multiprocess design

261 views
Skip to first unread message

Javier Guerra Giraldez

unread,
Jun 13, 2015, 4:40:04 AM6/13/15
to snabb...@googlegroups.com
TL;DR: 420Gb/sec with real processing!

ok, now the long and winding road:

the previous tests with multithreading SnabbSwitch had a nasty memory
limit: since all threads share the same memory space, we can't create
hundreds of LuaJIT VMs, because all fight for the same 2GB low
addresses.

turning to a fork() scheme, removed that limitation, and i have
happily ran 160 tasks under the same script. it's much simpler than
the pthread scheme. No need for C code, just call ljsyscall's
S.fork() and go. Almost the only change needed was to move most of
the core.shm calls after the forks, since the new process inherits the
LuaVM already initialized.

with that in place, i repeated the tests that previously gave us that
372Mpps figure. The initial number of packets per process was
somewhat higher than under pthreads, probably because of lower memory
pressure. i guess that having independent memory mappings help to
keep each tasks memory close to the corresponding core.

i looked for the diminishing returns point, but the total number of
packets was higher and higher the more threads i added, up to
675Mpps!!

well, it was obviously well past any reasonable point. and since it
was a simple source->sink transfer, it only tests how quickly can many
process switch between private LuaVMs.

i copied the new-style ipv6-tunnel code from the vm-sandbox thread and
repeated the tests with a source->encapsulate->decapsulate->sink setup
(so each packet passes both processes before being released). of
course it wasn't so high, but still kept rising the more processes i
added. raised the packetsize to 1500 bytes (from 120). got fewer
packets per second, but higher number of Gigabytes per second.

on a graph, it was clear that all tests (no work, tunnel, small
packets, large packets, everything) had a mostly linear growth up to
12 procs and then kept rising at a much lower rate and jaggy lines,
but it was undeniably growing at linear rate. that was just too
fishy.

then i realized that the supposedly "1 sec" benchmark was taking a few
secs in real time... hum. changed to 10 sec and it was totally
different! it seems that some processes were finishing before the
last ones went up, so adding more didn't increase concurrency, just
got a longer queue.

now, with longer running times, every test showed the same linear
growth until 12 procs and then flattened. happily, there wasn't any
'trashing' even at 160 procs, the total just stays the same. i
repeated the test at 60sec, just to be on the safe side, and the
numbers didn't change, so it seems they're legit.

the total mentioned above: 420Gb/sec is for 16 cores
encapsulating/decapsulating packets of 1500 bytes each (35Mpps). So
it seems we can easily attack multiple bidirectional 100Gb/sec ports
with a single machine! (well, we will need some way to split the
driver load between several tasks, and hope that the PCI bus delivers)

with the same seup and very small packets (120 bytes), the number of
packets is higher (up to 180Mpps) but the throughput is lower, maxing
at 175Gb/sec. still, almost saturating two bidirectional 100Gb/sec
channels with tiny packets... nothing to sneeze about, i guess :-)


--
Javier

Luke Gorrie

unread,
Jun 15, 2015, 12:43:32 PM6/15/15
to snabb...@googlegroups.com
Howdy,

On 13 June 2015 at 10:40, Javier Guerra Giraldez <jav...@snabb.co> wrote:
TL;DR: 420Gb/sec with real processing!

Awesomel! :-)

CPUs sure are fast now. 420 Gigabytes per second with in-place encapsulation and decapsulation (memmove) should be more than 6 Tbps (terabits per second) of aggregate load/store work the CPU is doing on packet data. Sure, this is only a small benchmark, but it is nice to see data that supports the theoretical capacity of the CPU and SIMD unit.

I'm estimating that a 36-core Skylake server will have around 42 Tbps of peak load/store capacity to L2 cache (maybe even L3). We are really getting a free ride here on the work that Intel is doing for HPC. I reckon we will be able to give ASIC based appliances a run for their money :-)

turning to a fork() scheme, removed that limitation, and i have
happily ran 160 tasks under the same script.  it's much simpler than
the pthread scheme.  No need for C code, just call ljsyscall's
S.fork() and go.   Almost the only change needed was to move most of
the core.shm calls after the forks, since the new process inherits the
LuaVM already initialized.

This sounds wonderful. Sounds like we should run with the fork() based design. It is fantastic that you have been able to prototype the designs so quickly and sniff out where the trouble is :-)

I suppose this implies that when you create an app network you will explicitly assign each app to a specific process. That sounds attractive actually: then application developers can get hands-on experience with what parallelism works best for their applications. We should make sure that engine.configure() gracefully handles moving apps from one process to another, i.e. stop in the old process then start in the new, so that we can still experiment with dynamic load sharing in the simple way of updating placement of apps within the app network.

I suppose the fork() design also does not require the sandboxes and "new style" apps? It would be a really nice bonus if it would work automatically with the existing apps and not require any rewrites. There will be some behavior of the existing apps that is not optimal with respect to multiple processes, but we could start by simply documentation that, and then work on improving the apps (updating the existing ones or creating new ones).

Or do I miss something crucial here? Have not read the code properly yet...

Cool stuff! :-)

Cheers,
-Luke


Luke Gorrie

unread,
Jun 16, 2015, 4:07:38 AM6/16/15
to snabb...@googlegroups.com
On 15 June 2015 at 18:43, Luke Gorrie <lu...@snabb.co> wrote:
I suppose the fork() design also does not require the sandboxes and "new style" apps?

To expand on that a bit...

There are a couple of reasons the "sandboxed" apps that can't see each other are interesting:

1. If we want the freedom to run app instances in different VMs/processes then it is important that they don't casually depend on each other e.g. by expecting to communicate using shared Lua variables.

2. If we want to run multiple app instances in parallel within the same process then they need to be running in separate Lua VMs (lua_states).

Really there are many different ways we could tackle problem #1: separate VM per app, Lua-level sandbox per app, or simply explain that you should not depend on apps to have a shared environment (or otherwise carefully document the implications of running them in separate processes). This does not seem to be on the critical path for the parallel app network implementation: that will work without sandboxes and separately we can consider whether they are worth adding in their own right.

Problem #2 is also not on the critical path if we are fork()ing worker processes. Then we only have one thread of execution per process and so we only need one Lua VM.


Separately...


There is another issue in the back of my mind which is DMA memory mappings. If one process allocates a new HugeTLB then how do we map that into all the processes? That is, how do we make sure the 'struct packet *' addresses on the inter-app rings are actually pointing to the same valid memory in every worker process?

The simplest way would be to reserve memory at the start before forking and only use that. Probably reasonable. Fancier would be to catch SIGSEGV, recognize access to unmapped DMA memory, and map it in on the fly. That sounds pretty horrendously complicated though. Perhaps there is a middle ground: a simple mechanism that supports dynamic allocation.

There are likely other issues too e.g. what happens if a worker process crashes hard? (Can we reclaim the packet memory it was using? Or should we terminate the whole process tree and restart?) Conceivably it might even be worth making copies of packets when they pass between processes so that memory ownership is very clear. I am not sure.

I should read the code on your fork_poc branch to see how it works for now :). Have been away from the keys for a bit.

Cheers,
-Luke


Luke Gorrie

unread,
Jun 16, 2015, 4:21:41 AM6/16/15
to snabb...@googlegroups.com
also:

Can we please add all of these microbenchmarks to the `snabbmark` program?

SnabbBot is able to track benchmark results over time and detect improvements and regressions. However, this depends on the benchmarks to be well-defined and a part of the upstream code base. Giving them a name and adding them to snabbmark seems like a good solution.

Generally we have built up a bit of "technical debt" with making code changes based on performance measurements that are not reproducible. For example, we disabled the automatic restart of apps because it somehow caused performance degradation in some applications, but we have no well-defined way to test a fix for that and so the feature has been stuck in limbo. Really we need our CI to have excellent test coverage both for functionality and performance.


On 13 June 2015 at 10:40, Javier Guerra Giraldez <jav...@snabb.co> wrote:


--
Javier

--
You received this message because you are subscribed to the Google Groups "Snabb Switch development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snabb-devel...@googlegroups.com.
To post to this group, send an email to snabb...@googlegroups.com.
Visit this group at http://groups.google.com/group/snabb-devel.

Max Rottenkolber

unread,
Jun 16, 2015, 6:41:50 AM6/16/15
to snabb...@googlegroups.com
On Tue, 16 Jun 2015 10:21:40 +0200, Luke Gorrie wrote:

> Can we please add all of these microbenchmarks to the `snabbmark`
> program?
>
> SnabbBot is able to track benchmark results over time and detect
> improvements and regressions. However, this depends on the benchmarks to
> be well-defined and a part of the upstream code base. Giving them a name
> and adding them to snabbmark seems like a good solution.
>
> Generally we have built up a bit of "technical debt" with making code
> changes based on performance measurements that are not reproducible. For
> example, we disabled the automatic restart of apps because it somehow
> caused performance degradation in some applications, but we have no
> well-defined way to test a fix for that and so the feature has been
> stuck in limbo. Really we need our CI to have excellent test coverage
> both for functionality and performance.

+1 on this.

I can help with this too. You can simply send me your ad-hoc benchmark
code and I will add it as a snabbmark test and take care of any gritty
details that might arise.


Javier Guerra Giraldez

unread,
Jun 16, 2015, 10:04:06 AM6/16/15
to snabb...@googlegroups.com
On 16 June 2015 at 03:07, Luke Gorrie <lu...@snabb.co> wrote:
> On 15 June 2015 at 18:43, Luke Gorrie <lu...@snabb.co> wrote:
>>
>> I suppose the fork() design also does not require the sandboxes and "new
>> style" apps?
>
>
> To expand on that a bit...
>
> There are a couple of reasons the "sandboxed" apps that can't see each other
> are interesting:
>
> 1. If we want the freedom to run app instances in different VMs/processes
> then it is important that they don't casually depend on each other e.g. by
> expecting to communicate using shared Lua variables.
>
> 2. If we want to run multiple app instances in parallel within the same
> process then they need to be running in separate Lua VMs (lua_states).
>
> Really there are many different ways we could tackle problem #1: separate VM
> per app, Lua-level sandbox per app, or simply explain that you should not
> depend on apps to have a shared environment (or otherwise carefully document
> the implications of running them in separate processes). This does not seem
> to be on the critical path for the parallel app network implementation: that
> will work without sandboxes and separately we can consider whether they are
> worth adding in their own right.
>
> Problem #2 is also not on the critical path if we are fork()ing worker
> processes. Then we only have one thread of execution per process and so we
> only need one Lua VM.

nitpicking: the LuaVM itself is fork()ed, so it becomes one Lua per
process, but isn't separately initialized, it inherits all the state
up to the fork() point.

there's some confusing effects when inheriting core.shm objects, it
seems that the mmap()'ed part is shared, but there seem to be some
kernel per-process housekeeping that makes some surprising COW
effects. it's much easier to do the shm.map(name) _after_ the
fork(). the pid()-relative nameform makes it easy to share within a
fork, it would be nice to have a while-program-id too, to make it easy
to share between forks but different between program invocations.

>
>
> Separately...
>
>
> There is another issue in the back of my mind which is DMA memory mappings.
> If one process allocates a new HugeTLB then how do we map that into all the
> processes? That is, how do we make sure the 'struct packet *' addresses on
> the inter-app rings are actually pointing to the same valid memory in every
> worker process?

i still don't have a foolproof way to share packets between forks.
the mentioned benchmarks used a single freelist per process, and all
packets were allocated, processed and released within the same fork.

i have some hopes that it should be possible to allocate the freelist
at startup (before the forks) and keep the pointers consistent and the
DMA mapping shared and non-COW. that should allow the apps to share
pointers within that space. i guess it wouldn't be able to grow after
the fork() point

if that proves too problematic, the only way out i see would be to
work with offsets within the pool instead of pointers, and let each
fork map it at a different address.



> The simplest way would be to reserve memory at the start before forking and
> only use that. Probably reasonable. Fancier would be to catch SIGSEGV,
> recognize access to unmapped DMA memory, and map it in on the fly. That
> sounds pretty horrendously complicated though. Perhaps there is a middle
> ground: a simple mechanism that supports dynamic allocation.

mmap() supports an 'address hint'. maybe we could somehow reserve a
big address space (say, the third terabyte?) and ask mmap() to put
each DMA allocated block at an specific place. what i don't know is
if simply stacking one point right after the previous one would be ok,
or if there's some housekeeping data just after or just before the
mmap()ed block (i'm guessing no, since it favors page-aligned blocks)



> There are likely other issues too e.g. what happens if a worker process
> crashes hard? (Can we reclaim the packet memory it was using? Or should we
> terminate the whole process tree and restart?) Conceivably it might even be
> worth making copies of packets when they pass between processes so that
> memory ownership is very clear. I am not sure.

every non-shared posix resource is automatically released only after
the parent process calls wait() to get the return status. otherwise
it's a zombie process, taking all its resources.

our application-specific resources would be ours to deal with, from
the OS point of view they're shared so won't be released while other
forks are running. i guess since packets don't contain pointers to
other structures, i guess it would be safe to return to the freelist
to reuse; the biggest challenge would be how to identify which packets
were handled by this specific process.

not releasing those packets and restarting only the apps that died
should keep the system running but leaking memory. maybe setting some
'sync points' where a full terminate and restart (resetting the shared
freelist) would be acceptable.


> I should read the code on your fork_poc branch to see how it works for now
> :). Have been away from the keys for a bit.


it's really, really simple, just a 4-5 line spawn() function that does
a fork() and require() a named file. in the test, that file is just a
config.new()/app/link/engine.configure()/run(). the 'main' process
just initializes the shared stats object, spawn()s a lot, wait()s them
all, show stats.



--
Javier

Luke Gorrie

unread,
Jun 16, 2015, 12:26:22 PM6/16/15
to snabb...@googlegroups.com
On 16 June 2015 at 16:04, Javier Guerra Giraldez <jav...@snabb.co> wrote:
there's some confusing effects when inheriting core.shm objects, it
seems that the mmap()'ed part is shared, but there seem to be some
kernel per-process housekeeping that makes some surprising COW
effects.   it's much easier to do the shm.map(name)  _after_ the
fork().

I think it would be worth understanding the details of the surprising copy-on-write effects before we design around it.

If we would have to avoid inheriting shared memory mappings then it might almost be simpler to exec new processes instead of forking the existing one, but it seems like it should not come to that.
 
mmap() supports an 'address hint'.  maybe we could somehow reserve a
big address space (say, the third terabyte?)  and ask mmap() to put
each DMA allocated block at an specific place.

This is done already: check out the comments in memory.c.

DMA memory (e.g. where packets are allocated) is mapped to a predictable virtual address based on its physical address. So we could look at the pointer address and know what physical memory it refers to and somehow map it into the process based on that.

our application-specific resources would be ours to deal with, from
the OS point of view they're shared so won't be released while other
forks are running.  i guess since packets don't contain pointers to
other structures, i guess it would be safe to return to the freelist
to reuse; the biggest challenge would be how to identify which packets
were handled by this specific process.

Right.

The two most simple options I see are:

#1 crash of one process causes crash (restart) of all processes. In this case we have no partial failure from which we need to restore buffers.

#2 each packet belongs to only one process i.e. link.receive() would copy the packet instead of referring to a packet allocated by another process. This is not necessarily inefficient: sharing and copying probably look much the same when you look at the memory bus i.e. the core of the new process loads the packet data from L3 cache into L1/L2 cache.

The second sounds more complex but it would simplify a couple of things:

- Freelists could be per process since packets are not shared. (Packets are returned to the freelist when they have been received by another process i.e. link.read pointer has advanced on a 'struct link' to a different process.)

- Having link.receive() do more work would also be a suitable place to make sure the memory of the source packet is mapped before operating on it.

However it could also be that it is prohibitively expensive in practice.

I don't have an immediate clever idea for supporting partial failures and "garbage collecting" packets that were being processed by workers that crashed but maybe there is a simple way.

> I should read the code on your fork_poc branch to see how it works for now
> :). Have been away from the keys for a bit.

it's really, really simple, just a 4-5 line spawn() function that does
a fork() and require() a named file.  in the test, that file is just a
config.new()/app/link/engine.configure()/run().  the 'main' process
just initializes the shared stats object, spawn()s a lot, wait()s them
all, show stats.

It would be neat to create an absolute minimal branch that supports this benchmark and PR that to show us the details of the design (when you are ready, that is). Then we could even take this discussion on the Github PR where the code is more immediately accessible.

I am digging around on your snabbswitch repo but the branch I see is 1000+ lines of new code, probably because it is building on the previous experiments, and I don't immediately know what is relevant to the fork based design. So the simplest way out of that seems to be to create a new branch (e.g. rebase) containing the essentials only and PR that.


Javier Guerra Giraldez

unread,
Jun 17, 2015, 4:48:53 AM6/17/15
to snabb...@googlegroups.com
On 16 June 2015 at 11:26, Luke Gorrie <lu...@snabb.co> wrote:
> It would be neat to create an absolute minimal branch

done!

https://github.com/javierguerragiraldez/snabbswitch/tree/minimal-fork

do a benchmark with `snabb snsh -t lib.thread.fork`

also exorcised the "surprising COW effects", and unsurprisingly, there
wasn't any COW about them. just that the stats object keeps a
'local' and a 'global' set of counters; the local is referenced by the
pid number; so if the core.app file grabs it before fork()ing, it's
not really 'local'.


--
Javier

Luke Gorrie

unread,
Jun 17, 2015, 3:21:48 PM6/17/15
to snabb...@googlegroups.com
On 17 June 2015 at 10:48, Javier Guerra Giraldez <jav...@snabb.co> wrote:
On 16 June 2015 at 11:26, Luke Gorrie <lu...@snabb.co> wrote:
> It would be neat to create an absolute minimal branch

done!

https://github.com/javierguerragiraldez/snabbswitch/tree/minimal-fork

Great!

Can PR this with '[draft]' in the title any time you want feedback on the code.
 
also exorcised the "surprising COW effects", and unsurprisingly, there
wasn't any COW about them.   just that the stats object keeps a
'local' and a 'global' set of counters; the local is referenced by the
pid number; so if the core.app file grabs it before fork()ing, it's
not really 'local'.

Aha. Yes, I suppose there is potential for a lot of surprises like this.

So what do we need to solve to productionize this code?

The topics that come to mind are:

- Send packets over inter-process links
- Free packets without leaking (shared freelist(s)?)
- Have engine.configure() automatically start/stop child processes as needed
- Have child processes run only the apps that are assigned to them

... more?


Javier Guerra Giraldez

unread,
Jun 18, 2015, 2:25:04 AM6/18/15
to snabb...@googlegroups.com
On 17 June 2015 at 14:21, Luke Gorrie <lu...@snabb.co> wrote:
> Aha. Yes, I suppose there is potential for a lot of surprises like this.
>
> So what do we need to solve to productionize this code?

i guess the best use of core.shm is not to 'contain' existing FFI
objects, but within object constructors, to help those objects
implement each desired level of sharing/isolation.

>
> The topics that come to mind are:
>
> - Send packets over inter-process links

if we have per-process freelists, then i see two possible
inter-process packet accounting:

- copying packet data. the copy is allocated from the destination
process' freelist. the 'original' packet can be returned to its
freelist.

- copy only packet pointers. this would deplete the source process'
freelist. maybe we could 'swap' packet pointers between processes?
to send a packet from process A to process B, we copy the packet
pointer from A to B and a free packet pointer from B to A, so the
number of packets held by each process remains unchanged. IOW: the
'send' method has the packet pointer as an argument and returns a free
packet pointer, while the 'receive' method takes a free packet pointer
and returns a 'full' packet pointer.

> - Free packets without leaking (shared freelist(s)?)

a freelist is not the hardest lock-free structure out there (it can be
done with two LIFO stacks, i have a couple prototypes around), but
debugging those is a serious chore. and there's nothing faster than
non-shared structures.

i guess it would be best to use shm to hold per-process freelists, and
when a process dies, the 'master' process reapes the remains. maybe
returning packets into a global freelist that is used when
initializing new private freelists, in preference to allocating new
DMA chunks

not sure how to catch packets that were in flight... maybe from the
links structures?


> - Have engine.configure() automatically start/stop child processes as needed
> - Have child processes run only the apps that are assigned to them

this looks relatively simple: add a `config.proc()` call analogous to
the current `config.app()` and `config.link()` that simply keeps a
list of process names. in app.configure() do a fork() for each entry
and call apply_config_actions() within the subprocess, but only start
those apps that hold this process name.

in my code drafts i couldn't make that work because i assumed we would
need one freelist for each connected subgraph, then tried to hack a
simple recursive mark-and-sweep graph clustering......


i'll rework the benchmark-like test code to try packet transfer with
per-process freelists. i think the same API is usable for both
copy-data and swap-pointers ideas, let's compare both and see if
there's a significant advantage of either.



--
Javier
Reply all
Reply to author
Forward
0 new messages