calling C++ from javascript: how to modify an input argument?

163 views
Skip to first unread message

SimonHF

unread,
Apr 22, 2014, 12:54:16 PM4/22/14
to v8-u...@googlegroups.com
For example, I can get a uint like this in a C++ function: uint32_t myuint32 = args[0]->Int32Value();

But is it also possible to change the value somehow from C++ land, so that in javascript the variable passed into the function will reflect the changed value?

If this is possible with some C++ argument types and not others, then which types allow modification?

Thanks.

Andreas Rossberg

unread,
Apr 23, 2014, 5:28:24 AM4/23/14
to v8-u...@googlegroups.com
On 22 April 2014 18:54, SimonHF <sim...@gmail.com> wrote:
> For example, I can get a uint like this in a C++ function: uint32_t myuint32
> = args[0]->Int32Value();
>
> But is it also possible to change the value somehow from C++ land, so that
> in javascript the variable passed into the function will reflect the changed
> value?

You don't pass "variables" in JavaScript, you pass values.
Consequently, you cannot mutate arguments the way you suggest. (If
those values happen to be mutable objects, then you can of course
mutate those, but that has nothing to do with parameter semantics.)

/Andreas

mog...@syntheticsemantics.com

unread,
Apr 23, 2014, 1:03:00 PM4/23/14
to v8-u...@googlegroups.com
Simon,

To rationale behind Andreas' answer is that v8 implements a virtual machine and by definition the only way to move data into or out of it is copy-in/copy-out through a v8 interface.  Using native a plug-in that defeats the isolation of a v8 isolate will only break design assumptions in v8.

An off-heap buffer can be allocated and accessed from inside v8, but referencing that memory from within a JS program requires buffer access methods (Buffer Node.js v0.10.26 Manual & Documentation) limiting you to scalar types.  In practice, these operations result in copying the data from the buffer to the v8 heap anyhow, ultimately zero-copy in v8 is nearly impossible.

I wrote a native Node addon (https://www.npmjs.org/package/ems) that combines synchronization primitives with shared memory, it also depends on copy-in/out, and because it's a native plugin it deoptimizes code that uses it.  Nevertheless, it's still capable of millions of atomic updates per second, far better than is possible with messaging.

             -J

Simon

unread,
Apr 23, 2014, 1:55:02 PM4/23/14
to v8-u...@googlegroups.com
Thanks for the info and the link. Looks very interesting. I will definitely take a look at ems.

FYI here's what I have discovered so far: 

I created a native Node addon which consists of a function which does nothing. If javascript calls the vanilla function as quickly as possible then it manages about 3 million calls per second. I guess this is the high water mark.

If I modify the function so that it returns a string (which has to be created and the string bytes copied into the new string object) then the calls per second drop substantially depending upon the length of the returned string.

A way around this is to use the String::NewExternal() mechanism which provides a way to make an immutable external string inside v8.

So far I have not managed to get Buffer to give the same kind of performance as String::NewExternal(). Performance seems to be about a third as good :-( Still experimenting.

I'm also on the lookout for mutable objects, as Andreas suggested...

Thanks,
Simon


--
--
v8-users mailing list
v8-u...@googlegroups.com
http://groups.google.com/group/v8-users
---
You received this message because you are subscribed to a topic in the Google Groups "v8-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/v8-users/oIouqgJGfn4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to v8-users+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

SimonHF

unread,
Apr 23, 2014, 3:53:55 PM4/23/14
to v8-u...@googlegroups.com
FYI here are some perf results that I got calling different types of dummy C++ functions:

* estimate 25000322 calls per second; object: n/a, input: n/a, output: n/a
* estimate 20000019 calls per second; object: unwrapped, input: n/a, output: n/a
* estimate 13333240 calls per second; object: unwrapped, input: 3 ints, output: n/a
* estimate 10000010 calls per second; object: unwrapped, input: 3 ints, output: int
* estimate 7142827 calls per second; object: unwrapped, input: 3 ints, output: 8 byte str
* estimate 1428573 calls per second; object: unwrapped, input: 3 ints, output: 4KB byte str
* estimate 5405379 calls per second: object: unwrapped, input: 3 ints, output: 4KB byte str external
* estimate 338983 calls per second: object: unwrapped, input: 3 ints, output: 4KB byte buffer
* estimate 555556 calls per second: object: unwrapped, input: 3 ints, output: 4KB byte buffer external

So a dummy C++ function with no input, output, or object unwrapping can be called about 25M times per second on my laptop. However, calling the same function which unwraps its object can only be called 20M times per second. Then add 3 input parameters and the same function can only be called 13.3M times per second... etc. Then comes the interesting bit (for me anyway): If the function returns a 4KB large string then the calls per second drops down to 1.4M. However, using the String::NewExternal() method results in a much better -- as expected -- per second count of 5.4M. The disappointing figures are with node Buffer::New(); only 339K calls per second for the non-zero-copy method, and only 555K calls per second for the zero-copy version; about 10x slower than the String::NewExternal() method.

Why Buffer::New() is so slow...?
To unsubscribe from this group and all its topics, send an email to v8-users+unsubscribe@googlegroups.com.

mog...@syntheticsemantics.com

unread,
Apr 23, 2014, 4:18:17 PM4/23/14
to v8-u...@googlegroups.com
Simon,

A month ago I ran similar experiments and got results on the order of what you measured.  Two notes about this type of synthetic benchmark:

1. Use a high resolution timer (f.e.: npm install microtime) .  These results have suspicious times that you might get if you divide 1000000 operations by a small integer.

2. Try a set of experiments that sweep through of different number of iterations (i.e.: powers of two from 1-1M).  After some number of iterations (about 32k in my experiments) your code is recompiled with Crankshaft which has completely different execution characteristics for both JS and native addons (considered deoptimizations by Crankshaft).  These results have some combination of the two compilers.

          -J



To unsubscribe from this group and all its topics, send an email to v8-users+u...@googlegroups.com.

Simon

unread,
Apr 23, 2014, 5:22:14 PM4/23/14
to v8-u...@googlegroups.com
Thanks for the time tips. In these test then I'm only interested in getting the ball-park figures for the big picture but if I need a more accurate timer I'll definitely keep microtime in mind. I am very interested in exploring what you said about Crankshaft. Do you have some example code showing this effect? Thanks, Simon

mog...@syntheticsemantics.com

unread,
Apr 23, 2014, 5:43:39 PM4/23/14
to v8-u...@googlegroups.com
To compare the performance of calling a non-inlinable library function (sin from libm) which can be accessed via JS as Math.sin() or through my native addon which calls libm's sin().  I assume Node is using the same math library, so I'm really measuring the difference between external native function calls which can be optimized by Crankshaft vs. external native function calls which require copy-in/out of arguments and results.  The JS jig looked like this:

var nOps = 1
var totalOps = 100000000
var microtime = require('microtime');

function rightJustify(strArg, nChars) {
    str = '                    ' + strArg
    return str.toString().substr(str.length-nChars, str.length)
}


while(nOps <= totalOps) {
    var startTime = microtime.now()
    for(var i = 0;  i < nOps;  i++) {
        workfun()
    }
    var opsPerSec = (nOps * 1000000) / (microtime.now()-startTime)
    console.log(rightJustify(nOps,10) + " workfun operations performed at " + 
                rightJustify(Math.floor(opsPerSec), 10) + " ops/sec")   
    nOps *= 2
}


Results looked like this:
Math.sin sum:           1 operations performed at        827 ops/sec
Addon sin sum:          1 operations performed at       9803 ops/sec
Math.sin sum:           2 operations performed at     200000 ops/sec
Addon sin sum:          2 operations performed at     105263 ops/sec
Math.sin sum:           4 operations performed at    4000000 ops/sec
Addon sin sum:          4 operations performed at    1333333 ops/sec
Math.sin sum:           8 operations performed at    8000000 ops/sec
Addon sin sum:          8 operations performed at    4000000 ops/sec
Math.sin sum:          16 operations performed at   16000000 ops/sec
Addon sin sum:         16 operations performed at    4000000 ops/sec
Math.sin sum:          32 operations performed at    3555555 ops/sec
Addon sin sum:         32 operations performed at    4571428 ops/sec
Math.sin sum:          64 operations performed at   21333333 ops/sec
Addon sin sum:         64 operations performed at    3764705 ops/sec
Math.sin sum:         128 operations performed at   14222222 ops/sec
Addon sin sum:        128 operations performed at     907801 ops/sec
Math.sin sum:         256 operations performed at   17066666 ops/sec
Addon sin sum:        256 operations performed at     733524 ops/sec
Math.sin sum:         512 operations performed at   15515151 ops/sec
Addon sin sum:        512 operations performed at    5019607 ops/sec
Math.sin sum:        1024 operations performed at   12190476 ops/sec
Addon sin sum:       1024 operations performed at    5389473 ops/sec
Math.sin sum:        2048 operations performed at   13562913 ops/sec
Addon sin sum:       2048 operations performed at    5251282 ops/sec
Math.sin sum:        4096 operations performed at   13791245 ops/sec
Addon sin sum:       4096 operations performed at    3230283 ops/sec
Math.sin sum:        8192 operations performed at   12226865 ops/sec
Addon sin sum:       8192 operations performed at    4571428 ops/sec
Math.sin sum:       16384 operations performed at   12064801 ops/sec
Addon sin sum:      16384 operations performed at    4571428 ops/sec
Math.sin sum:       32768 operations performed at   17645665 ops/sec
Addon sin sum:      32768 operations performed at    5759887 ops/sec
The re-compilation occurs, and the actual overhead of calling C from JS becomes apparent:
Math.sin sum:       65536 operations performed at   22028907 ops/sec
Addon sin sum:      65536 operations performed at    5907869 ops/sec
Math.sin sum:      131072 operations performed at   21962466 ops/sec
Addon sin sum:     131072 operations performed at    5938114 ops/sec
Math.sin sum:      262144 operations performed at   21907404 ops/sec
Addon sin sum:     262144 operations performed at    5915869 ops/sec
Math.sin sum:      524288 operations performed at   22024280 ops/sec
Addon sin sum:     524288 operations performed at    5932336 ops/sec


        -J

mog...@syntheticsemantics.com

unread,
Apr 23, 2014, 5:52:12 PM4/23/14
to v8-u...@googlegroups.com
I should point out this experiment came from when I was trying to replicate these results:

In his case, the entire work function is optimized away by Crankshaft in a very obvious way.  The experiment compares his loop body to a no-op.

              -J


Work Function: Math.floor(133.7 / Math.PI)

         1 workfun operations performed at       7092 ops/sec
         1 no-ops performed at                 333333 ops/sec
         2 workfun operations performed at     142857 ops/sec
         2 no-ops performed at               Infinity ops/sec
         4 workfun operations performed at   Infinity ops/sec
         4 no-ops performed at               Infinity ops/sec
         8 workfun operations performed at    8000000 ops/sec
         8 no-ops performed at               Infinity ops/sec
        16 workfun operations performed at   16000000 ops/sec
        16 no-ops performed at               Infinity ops/sec
        32 workfun operations performed at   16000000 ops/sec
        32 no-ops performed at               Infinity ops/sec
        64 workfun operations performed at    9142857 ops/sec
        64 no-ops performed at               Infinity ops/sec
       128 workfun operations performed at     405063 ops/sec
       128 no-ops performed at               Infinity ops/sec
       256 workfun operations performed at    1855072 ops/sec
       256 no-ops performed at              256000000 ops/sec
       512 workfun operations performed at   64000000 ops/sec
       512 no-ops performed at              256000000 ops/sec
      1024 workfun operations performed at   68266666 ops/sec
      1024 no-ops performed at              256000000 ops/sec
      2048 workfun operations performed at   60235294 ops/sec
      2048 no-ops performed at              292571428 ops/sec
      4096 workfun operations performed at   52512820 ops/sec
      4096 no-ops performed at              273066666 ops/sec
      8192 workfun operations performed at   66064516 ops/sec
      8192 no-ops performed at              282482758 ops/sec
     16384 workfun operations performed at   59148014 ops/sec
     16384 no-ops performed at              163840000 ops/sec
Suddenly a re-compilation with additional optimization occurs:
     32768 workfun operations performed at  910222222 ops/sec
     32768 no-ops performed at              910222222 ops/sec
     65536 workfun operations performed at  923042253 ops/sec
     65536 no-ops performed at              923042253 ops/sec
    131072 workfun operations performed at  929588652 ops/sec
    131072 no-ops performed at              929588652 ops/sec
    262144 workfun operations performed at  929588652 ops/sec
    262144 no-ops performed at              929588652 ops/sec
    524288 workfun operations performed at  931239786 ops/sec
    524288 no-ops performed at              931239786 ops/sec



Simon

unread,
Apr 23, 2014, 6:00:43 PM4/23/14
to v8-u...@googlegroups.com
Thanks for the info but hmmm... I'm a bit confused now. In the first example sent then 'addon sin sum' hardly changes at all after recompilation and 'homes in' on 5.9M ops/sec. In the second example sent then there's a massive jump for both after recompilation. Why the difference in behaviour? Under which circumstances can addons benefit from the recompilation? Thanks, Simon


mog...@syntheticsemantics.com

unread,
Apr 23, 2014, 6:37:16 PM4/23/14
to v8-u...@googlegroups.com
Simon,

One difference is the second set (replicating the experiment in https://kkaefer.com/node-cpp-modules/#benchmark-thread-pool) uses a synthetic workload which the compiler can get rid of entirely, so the benchmark isn't timing work, it's the same as executing a no-op.  A second difference is the timings include some combination of optimized and unoptimized execution which isn't a number you can use to make performance predictions based on number of iterations.

The problem is the work function is invariant and the results are not used, so the compiler is free to hoist the loop body out or just get rid of it:
function() { return Math.floor(133.7 / Math.PI); }

My test loop calls sin() which the compiler cannot analyze so it must assume there are side effects, and it must call the function every iteration.  Additionally, the return values are summed so the compiler can't optimize the loop to execute only the last iteration, they must all be executed:
    for(var i = 0;  i < nOps;  i++) {
        sum += Math.sin(i)
    }

Crankshaft performs many additional optimizations (dead code elimination, hoisting, native compilation, etc.), but v8 can't recompile the interface to a native addon so all the copy-in/out scaffolding remains, the use of Math.sin allows that overhead to be optimized.  For practical purposes, the overhead of copy in/out is the only difference between the two sin() experiments.  

Regardless of how it's compiled, as the trip counts increase the performance asymptotically approaches some maximum for the architecture.  If anything, native code gets in the way of Crankshaft optimizations which is why the benefit is smaller for the native addon experiments than JS code alone.

The overhead for copy-in/out is significant, but unavoidable.  For EMS, the benefit of the overhead was getting access to all the cores, and that performance multiplier easily overcomes the overhead.  FWIW, the additional overhead for handling strings is relatively small, so you shouldn't consider use limited to scalar values.

           -J

Simon

unread,
Apr 23, 2014, 6:48:14 PM4/23/14
to v8-u...@googlegroups.com
Thanks for the detailed reply.

In my results then using string values can be significantly slower than scalar values if the string is big enough, e.g. a C++ function returning an 8 byte string can be called 7.1M times per second, but change the string returned to a 4KB string and that 7.1M drops down to only 1.4M times per second. Whereas, the same 4KB string returned using String::NewExternal() manages 5.4M calls per second. So you might consider adding String::NewExternal() to ems if it is not used already :-)

* estimate 7142827 calls per second; object: unwrapped, input: 3 ints, output: 8 byte str
* estimate 1428573 calls per second; object: unwrapped, input: 3 ints, output: 4KB byte str
* estimate 5405379 calls per second: object: unwrapped, input: 3 ints, output: 4KB byte str external

Also, in case you haven't already seen it, then whitedb [1] reminds me a bit of ems. Seems like somebody has also attempted a node port of whitedb too [2].

--
Simon

Reply all
Reply to author
Forward
0 new messages