Parallel or not parallel

Giancarlo Niccolai

unread,

Jan 21, 2013, 3:22:56 PM1/21/13

to falc...@googlegroups.com

Hey ppl,

I need an informed opinion from the community on how to clear a problem, and I don't want to have just one mind set on this. I need more brains on this problem, so, if you can think on this and possibly spread the challenge, you'll be doing some great favor to all the future users of our language.

Here I am reporting the latest results of the developments of the new engine, now closing in for alpha 1.0 release: I have finally coded in the item locking system that allows to concurrently access items from different parallel agents.

In the following sample:

http://ideone.com/m9Foov

you can see two agents concurrently writing on the t.a property; one is writing a number, the other one is writing a string. A third agent is reading the result, and it prints them out as fast as it can. As the output stream is, well, a stream, we all understand that the actual output rate is limited by the O/S giving us write access to it.

The experiment shows that the check agent reads randomly a string or an integer in the t.a property; the point of the experiment is that the read is clean, and we never have a partial data in the item (that is, we never see the item with a type ID of an integer, but the data of a String).

The agents run in the test parallel engine, that is currently configured to run on 4 processors. On a 4CPU machine, you'll see the two generator agents running at full speed; on a 2CPU machine, one of the agents will be preempted by the check agent, that will then block in order to write the data on I/O. In future, we can introduce some schedule optimizations so that agents are swapped out of VM processors during I/O times, and calculating agents can progress in the meanwhile (currently we DO use this strategy in our 0.9.x Garbage Collection system), but for now, consider this test really showing the full potential of this technique on a 4CPU machine only.

This script runs "start to stop" in approx 3.5 seconds on my test machine (an old HP Intel 2.4Gx4 32 bit). This includes a lot of things we don't really want to measure now, as the compiler, module serializer etc.

These are the time results on an average run:
real    0m3.416s
user    0m5.020s
sys    0m0.160s

As we don't have any GC in place now (that is anyhow parallel and should not impact negatively on the performance), the script generates some 300M strings in memory; this heavily stresses the system memory manager, and it's likely that using a gc as good as that one we have in 0.9 version, this time could go down of some 20%.

A Lua script doing roughly something equivalent is the following:

http://ideone.com/7d8bEX

real    0m2.112s
user    0m2.108s
sys    0m0.000s

Notice that the first loop, the one dealing just with numbers, just takes 0.3s thanks to an excellent optimization of the Lua engine, something we can have as well, so it should be probably taken out of this equation for the purpose of the opinion I am going to ask now.

The performance different is not so impressing when you use two string generating loops in both the scripts.

(notice also that increasing the print rate by reducing the %1000 to %100, or %50 on some systems, just kills the lua version).

Ok, now you have all the parameters to help me out in this decision.

As it is now, the engine looks like some 50% slower than Lua on serial problems; OTOH, on parallel problems it can easily beat LUA, at times even by a great deal thanks to real parallelism in I/O and calculus. On real world situations, exp in high order math done with math libraries, we should have a LONG leading edge. OTOH, on plain script-calculus intensive single-streamlined loops, the engine is roughly 25% to 75% slower than Lua.

The point is that removing the item lock feature, that is, the feature enabling concurrently visible read-writes of items, the engine is consistently 50% faster. Yep, making it at par with Lua also on serial loops. And, we have no optimization in place and there are MANY extra things in the engine that can get faster in future releases. In other words, without the parallel support, we could really be playing at a par with the fastest scripting language EVEN in its own field, and EVEN if our primary concern is NOT the raw speed of the VM loop (but OTHER kind of speed performance, as i.e. the performance of the application code sharing data with our engine).

We have mainly two choices, and possibly a third:
1) Creating two versions of the engine, one with Parallel() stuff turned off, would enable programmers that need a parallel environment (i.e. in robotics) a solid platform on which to work on, while giving general scripters (i.e. simple game AIs) more raw power.
2) OTOH, forcing parallel on would require third party module writers to ALWAYS take the parallelization of the engine into account, and write threadsafe modules. IF we do not, we might end up having a bunch of modules that wouldn't simply be working in the parallel version.
3) The third option would be that of forfeiting the visibility rule that allows to see "a=1" consistently across all agents on shared variables. As we do now with the Threading module, each agent would have a complete copy of its own memory, and there would be streamlined communication channels to send stuff across agents.

Notice that even in case of 1, the engine stays intrinsically MT, and as such it can be proficiently used in MT applications, and used asynchronously, even if the script themselves wouldn't be able to use this asynchronism.

As a side note, I already took several steps to experiment with item locking done only when strictly needed. For instance, I tested a verson where item locking is done in case of property access ONLY (the case used in the sample script), and times are roughly the same, proving that at least 95% of the 50% performance drop when using item locking is due to the very read-write concurrency on the accessed properties.

I also cherished the possibility to start using item locking only after a Parallel.launch() command has been issued, but I haven't found any reasonable way to really do that efficiently enough atm.

So, if any brilliant mind is listening and willing to take the challenge, I am eager to have your POV and suggestions. And if you can, also coding hands.

Bests,
Giancarlo.

Klaim - Joël Lamotte

unread,

Jan 21, 2013, 4:24:58 PM1/21/13

to falc...@googlegroups.com

Keep in mind I'm not yet a guru in parallel code, even if my current game engine is heavily parallelized.

In my use of Falcon (a game), which might be different from other's use case, I embedd falcon in several different systems as scripting language.

I'm not sure exactly if the parallel execution inside the scripts would benefit me, but I assume it could as I will let the player script some parts the way he wants.

(which is another subject)

So, solution 1) seems like a good solution at first sight but I think it makes things complicated for implementers and people distributing falcon programs I guess.

I don't know how to design such language implementation with a runtime condition to turn paralell protections on or off. I think it's impossible to avoid costs in all cases with such a solution.

The solution 3 seems reasonable to me at first, but I think it would require mesurement to see if you don't loose too much on space and speed.

My C++ reflex comes in and ask: would it be possible to mark an objet as thread-safe instead of making everything thread-safe? Would it help?

Because the problem is basically how to make the user pay only if he use parallelism, but still exploit parallelism fully when used?

So, letting the user specify the code that needs protection from parallel use seems to me the most correct way to deal with it.

However I don't know if it have to be implemented efficiently.

Joel Lamotte

Giancarlo Niccolai

unread,

Jan 21, 2013, 4:33:51 PM1/21/13

to falc...@googlegroups.com

Il 21/01/2013 22:24, Klaim - Jo�l Lamotte ha scritto:
>
> My C++ reflex comes in and ask: would it be possible to mark an objet
> as thread-safe instead of making everything thread-safe? Would it help?

You know, talking in the channel I just stumbled across an idea.

The parallel model of the new engine REQUIRES a script to start parallel
execution via the Parallel class (and the invoker MUST wait for complete
execution of the context group that is then created).

In other words, the machine is either in single script or parallel
script mode.

Currently, the code I used is something like:

void Item::assign( Item& source ) {
source.lock();
copy source
source.unlock();

this.lock():
copy from the temp item
this.unlock();
}

Using a flag set by Parallel.launch() would be something like

void Item::assign( Item& source ) {
if( vm->m_isParallel ) {
source.lock();
copy source
source.unlock();

this.lock():
copy from the temp item
this.unlock();
}
else {
just copy;
}
}

I don't know how much time that extra isParallel is going to take, but
it's worth a try. If it just bogs us down of, say, 2-5%, then it's a big
win with respect to 50%.

The same flag could also be used for other parallel-sensible operation,
like merging strings; in a parallel world, I have to respect the same
source lock, source copy, target lock scheme, which is WAY less
efficient than a direct copy you can do in a single threaded world.

The only problem is that I have no VM in sight in the item API. I should
somehow pass it around, or take it from the thread local stack (but
that's highly inefficient).

Gian.

Klaim - Joël Lamotte

unread,

Jan 21, 2013, 4:41:04 PM1/21/13

to falc...@googlegroups.com

On Mon, Jan 21, 2013 at 10:33 PM, Giancarlo Niccolai <g...@niccolai.cc> wrote:

I don't know how much time that extra isParallel is going to take, but
it's worth a try. If it just bogs us down of, say, 2-5%, then it's a big
win with respect to 50%.

The same flag could also be used for other parallel-sensible operation,
like merging strings; in a parallel world, I have to respect the same
source lock, source copy, target lock scheme, which is WAY less
efficient than a direct copy you can do in a single threaded world.

Could you clarify if the flag would be set for the VM or the script (what would be it's object name? instruction queue or something? sorry didn't touch Falcon for a while)

or a part of the VM?

Joel Lamotte

Giancarlo Niccolai

unread,

Jan 21, 2013, 5:57:40 PM1/21/13

to falc...@googlegroups.com

For the VM, or actually, for the context. The main context cannot run in parallel, so, if your host context is the main one, you're sure to run freely.

In the meanwhile, we did a bit of experiments.

First of all, the read of 50% degrade in performance was overestimated. I kept using the MT version, and when I added the item locking, the checker thread started to contend the items to the updater like mad. So, there was a great deal of contention in that code which went totally unnoticed to me.

Using a single threaded looper version, I noticed that the real impact of the item locking code on single threaded scripts is around 10% (provided they're script-computation intensive. The percentage drops when you do I/O or library based ops, of course).

We tested the if( vmflag) approach, and noticed that the impact is about the same. Thinking of it, it's just logical, as the lock code is...

   void assignFromLocal( const Item& other )
   {
      other.copied(true);
      lock();
      copy(other);
      unlock();
   }

   void lock() const {
      while(!atomicCAS(lockId, 0, 1)) {}
   }

   void unlock() const {
      atomicSet(lockId, 0);
   }

The copy() operation itself is a few cpu cycles.

The CAS operation is just slighly less efficient than an IF performed on a cache-remote variable, as they both need to flush the cache; and the VM (or the context) is surely far from the cache.

The atomic set opration overhead is negligible, as it's done in cache and just sets it as dirty; subsequent operations in the VM are very likey to dump it anyhow, so it's net effect is near to 0.

In other words, the game is not worth the candle. The if(parallel) { lock(), copy(), unlock() } version just adds an extra cache dump + cmp/jmp on the parallel branch.

10% degrade -at worst, possibly less in real world programs - on the ST part is not negligible, but bearable, so if nothing comes -- fast -- against it, I am just keeping the locking code.

If someone has an idea on how to make it better, I am all ears.

Gian.

Giancarlo Niccolai

unread,

Jan 21, 2013, 6:33:29 PM1/21/13

to falc...@googlegroups.com

To give the challengers some other information:

Items known to be in the local stack never need to be interlocked, as they are private to the VMContext in control.

Then we have an assignFromLocal operation which interlocks the assignee, leaving the assignand free of lock. This is meant to store items known to be either in the local stack or in the C stack (i.e. temporary items created locally) onto items whose location is uncertain.

Then the full assign operation locks the source, copies it into a local C stack item, and releases it; then it locks the assignee, stores the copied item into it and releases the assignee.

For how this operation seems heavy, I have tested it by just using the assignee lock, as if it was a simple assignFromLocal, and the performance didn't change sensibly. It seems that once the cache is hit, the other operations are efficiently streamlined.

Local variables are currently dynamically resolved; a local small map of Symbol->Item* is formed, so, even if the symbol resolves in a local variable (a variable in the local VMContext stack), it's not possible to know that unless we add some check (and that would bring us back to the uselessness of the if(parallel) check). So, the non-interlocked assign solutions can be employed if and only if it can be demonstrated that one or both the items involved are either in a VMContext local stack or temporarily stored in a C stack.

Gian.

iMe

unread,

Feb 6, 2013, 4:19:29 AM2/6/13

to falc...@googlegroups.com

Hi,

Isn't it possible to go for solution1, BUT, combine both versions into one single binary.

In that case, the only impact would be a larger binary meaning a larger footprint.

But it would postpone the need of solving this right now.

Maybe later on, when things evolves, we can go back to only one version.

It would be nice though if you could manually in the script decide which version you would like to use (parallel enabler/disabler).

You would then of course not be able to switch during execution.

The big advantage is that we will only have one binary, one installation that can execute all falcon programs.

The only disadvantage I can see is that we will have a lager binary to distribute.

Maybe we could even solve that the not used version is not loaded into ram, and thereby not increasing the memory footprint of Falcon.

BR

iMe

Giancarlo Niccolai

unread,

Feb 7, 2013, 9:30:37 AM2/7/13

to falc...@googlegroups.com

Il 06/02/2013 10:19, iMe ha scritto:
>
> But it would postpone the need of solving this right now.
> Maybe later on, when things evolves, we can go back to only one version.
>

I think this is a wise advice.

Also, in the meanwhile we have experimented more (also thanks to the
help of Paul), and discovered that the initial figure of a 50% was
highly overestimated because of the way the test explicitly caused
contention: even when using a single thread to do the loop, the second
thread print loop was still over contending the resource, causing a
delay that wasn't due to pure computation in the other thread.

The final figure is about 10% slowdown with lock/unlock, which is not
negligible, but bearable.

Since that, I have removed a "caution" lock that I added to every "="
Item class C++ operator overload , which are the vast majority in the
engine (the assign() doing double lock with copy and explicitly
one-sided lock copies are just around in the number of 10 or less ), and
that should shape down the figure yet, but I didn't retest atm (and
there may be some place where that leads to crash and requires an
explicit lock, will have to check). However, with that shaped down, the
difference should be something about 5% or less.

Also, in the meanwhile I removed the "Autoexpression" statement and
simplified the tree model. Let me explain:

There are 3 categories of treestep (the PSteps, or atomic VM operations,
which directly represent script constructs):
- Expressions
- Statements
- Syntactic Trees.

The difference between expressions and statements is that an expression
were BOUND to create exactly one value out of its arity; that is, it
MUST take N elements from the stack and then ADD exactly one.

Statements and trees were bound to LEAVE THE STACK as they found it. So,
for instance, the while statement removed the extra element its
expression left on the stack (the true/false item that controls the loop).

Syntactic trees were basically vectors of statements.

The autoexpression was a statement composed of exactly one expression;
so, for instance, this while loop:

while a < 10
printl(a)
a++
end

had a syntree per child, composed of two autoexpression statements, one
being the call expression and the other being the a++

This was fair and nice, but to cut the long story short, I discovered
that it was a bad idea when it comes to dynamic code. You'd like to add
trees to trees, or if you simply have an expression in your hands and
want to add it to a tree, you don't want to have to wrap it in an
autoexpression, and if you have an autoexpression statement around, you
might want to have its expression set directly as a while selector... in
short, it was a hell, not for the engine code itself, but for the
scripters to use the dynamic code approaches.

So, I changed the approach and removed the autoexpression. Now EVERY
TREESTEP is bound to leave a value on the stack, as if it was an
expression; removing it is a duty left to its parent TreeStep, or to the
topmost process code when the process terminates.

This allows to set everything everywhere. For instance, you could set a
treestep as the while selector. Since there is a "exit from step with
value" operator, this can be actually meaningful.

As a result, the ubiquitous extra and non light autoexpression statement
disappeared. This alone brought an improvement of about 15% performance
on raw loops; so, the 10% figure caused by the MT lock is
overcompensated. Not that removing it completely makes no sense, but at
least it looks less of a stretch now.

As a sidenote, the removal of autoexpression is a great help in rules,
which are bound to intercept the result of expressions in their code.
Formerly, autoexpression had to be configured differently if it was in a
interactive compiler or in a rule to yield the exit value of its
expression somewhere (that somewhere was the old "A" register), but now
the rule can just peek the expression value left on the stack, checking
it before discarding it. This should make rules much more easier and
faster to run.

Gian.

Reply all

Reply to author

Forward