calling C++ from javascript: how to modify an input argument?

SimonHF

unread,

Apr 22, 2014, 12:54:16 PM4/22/14

to v8-u...@googlegroups.com

For example, I can get a uint like this in a C++ function: uint32_t myuint32 = args[0]->Int32Value();

But is it also possible to change the value somehow from C++ land, so that in javascript the variable passed into the function will reflect the changed value?

If this is possible with some C++ argument types and not others, then which types allow modification?

Thanks.

Andreas Rossberg

unread,

Apr 23, 2014, 5:28:24 AM4/23/14

to v8-u...@googlegroups.com

On 22 April 2014 18:54, SimonHF <sim...@gmail.com> wrote:
> For example, I can get a uint like this in a C++ function: uint32_t myuint32
> = args[0]->Int32Value();
>
> But is it also possible to change the value somehow from C++ land, so that
> in javascript the variable passed into the function will reflect the changed
> value?

You don't pass "variables" in JavaScript, you pass values.
Consequently, you cannot mutate arguments the way you suggest. (If
those values happen to be mutable objects, then you can of course
mutate those, but that has nothing to do with parameter semantics.)

/Andreas

mog...@syntheticsemantics.com

unread,

Apr 23, 2014, 1:03:00 PM4/23/14

to v8-u...@googlegroups.com

Simon,

To rationale behind Andreas' answer is that v8 implements a virtual machine and by definition the only way to move data into or out of it is copy-in/copy-out through a v8 interface. Using native a plug-in that defeats the isolation of a v8 isolate will only break design assumptions in v8.

An off-heap buffer can be allocated and accessed from inside v8, but referencing that memory from within a JS program requires buffer access methods (Buffer Node.js v0.10.26 Manual & Documentation) limiting you to scalar types. In practice, these operations result in copying the data from the buffer to the v8 heap anyhow, ultimately zero-copy in v8 is nearly impossible.

I wrote a native Node addon (https://www.npmjs.org/package/ems) that combines synchronization primitives with shared memory, it also depends on copy-in/out, and because it's a native plugin it deoptimizes code that uses it. Nevertheless, it's still capable of millions of atomic updates per second, far better than is possible with messaging.

-J

Simon

unread,

Apr 23, 2014, 1:55:02 PM4/23/14

to v8-u...@googlegroups.com

Thanks for the info and the link. Looks very interesting. I will definitely take a look at ems.

FYI here's what I have discovered so far:

I created a native Node addon which consists of a function which does nothing. If javascript calls the vanilla function as quickly as possible then it manages about 3 million calls per second. I guess this is the high water mark.

If I modify the function so that it returns a string (which has to be created and the string bytes copied into the new string object) then the calls per second drop substantially depending upon the length of the returned string.

A way around this is to use the String::NewExternal() mechanism which provides a way to make an immutable external string inside v8.

So far I have not managed to get Buffer to give the same kind of performance as String::NewExternal(). Performance seems to be about a third as good :-( Still experimenting.

I'm also on the lookout for mutable objects, as Andreas suggested...

Thanks,

Simon

--
--
v8-users mailing list
v8-u...@googlegroups.com
http://groups.google.com/group/v8-users
---
You received this message because you are subscribed to a topic in the Google Groups "v8-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/v8-users/oIouqgJGfn4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to v8-users+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

SimonHF

unread,

Apr 23, 2014, 3:53:55 PM4/23/14

to v8-u...@googlegroups.com

FYI here are some perf results that I got calling different types of dummy C++ functions:

* estimate 25000322 calls per second; object: n/a, input: n/a, output: n/a

* estimate 20000019 calls per second; object: unwrapped, input: n/a, output: n/a

* estimate 13333240 calls per second; object: unwrapped, input: 3 ints, output: n/a

* estimate 10000010 calls per second; object: unwrapped, input: 3 ints, output: int

* estimate 7142827 calls per second; object: unwrapped, input: 3 ints, output: 8 byte str

* estimate 1428573 calls per second; object: unwrapped, input: 3 ints, output: 4KB byte str

* estimate 5405379 calls per second: object: unwrapped, input: 3 ints, output: 4KB byte str external

* estimate 338983 calls per second: object: unwrapped, input: 3 ints, output: 4KB byte buffer

* estimate 555556 calls per second: object: unwrapped, input: 3 ints, output: 4KB byte buffer external

So a dummy C++ function with no input, output, or object unwrapping can be called about 25M times per second on my laptop. However, calling the same function which unwraps its object can only be called 20M times per second. Then add 3 input parameters and the same function can only be called 13.3M times per second... etc. Then comes the interesting bit (for me anyway): If the function returns a 4KB large string then the calls per second drops down to 1.4M. However, using the String::NewExternal() method results in a much better -- as expected -- per second count of 5.4M. The disappointing figures are with node Buffer::New(); only 339K calls per second for the non-zero-copy method, and only 555K calls per second for the zero-copy version; about 10x slower than the String::NewExternal() method.

Why Buffer::New() is so slow...?

To unsubscribe from this group and all its topics, send an email to v8-users+unsubscribe@googlegroups.com.

mog...@syntheticsemantics.com

unread,

Apr 23, 2014, 4:18:17 PM4/23/14

to v8-u...@googlegroups.com

Simon,

A month ago I ran similar experiments and got results on the order of what you measured. Two notes about this type of synthetic benchmark:

1. Use a high resolution timer (f.e.: npm install microtime) . These results have suspicious times that you might get if you divide 1000000 operations by a small integer.

2. Try a set of experiments that sweep through of different number of iterations (i.e.: powers of two from 1-1M). After some number of iterations (about 32k in my experiments) your code is recompiled with Crankshaft which has completely different execution characteristics for both JS and native addons (considered deoptimizations by Crankshaft). These results have some combination of the two compilers.

-J

To unsubscribe from this group and all its topics, send an email to v8-users+u...@googlegroups.com.

Simon

unread,

Apr 23, 2014, 5:22:14 PM4/23/14

to v8-u...@googlegroups.com

Thanks for the time tips. In these test then I'm only interested in getting the ball-park figures for the big picture but if I need a more accurate timer I'll definitely keep microtime in mind. I am very interested in exploring what you said about Crankshaft. Do you have some example code showing this effect? Thanks, Simon

mog...@syntheticsemantics.com

unread,

Apr 23, 2014, 5:43:39 PM4/23/14

to v8-u...@googlegroups.com

To compare the performance of calling a non-inlinable library function (sin from libm) which can be accessed via JS as Math.sin() or through my native addon which calls libm's sin(). I assume Node is using the same math library, so I'm really measuring the difference between external native function calls which can be optimized by Crankshaft vs. external native function calls which require copy-in/out of arguments and results. The JS jig looked like this:

var nOps = 1

var totalOps = 100000000

var microtime = require('microtime');

function rightJustify(strArg, nChars) {

str = ' ' + strArg

return str.toString().substr(str.length-nChars, str.length)

}

while(nOps <= totalOps) {

var startTime = microtime.now()

for(var i = 0; i < nOps; i++) {

workfun()

}

var opsPerSec = (nOps * 1000000) / (microtime.now()-startTime)

console.log(rightJustify(nOps,10) + " workfun operations performed at " +

rightJustify(Math.floor(opsPerSec), 10) + " ops/sec")

nOps *= 2

}

Results looked like this:

Math.sin sum: 1 operations performed at 827 ops/sec

Addon sin sum: 1 operations performed at 9803 ops/sec

Math.sin sum: 2 operations performed at 200000 ops/sec

Addon sin sum: 2 operations performed at 105263 ops/sec

Math.sin sum: 4 operations performed at 4000000 ops/sec

Addon sin sum: 4 operations performed at 1333333 ops/sec

Math.sin sum: 8 operations performed at 8000000 ops/sec

Addon sin sum: 8 operations performed at 4000000 ops/sec

Math.sin sum: 16 operations performed at 16000000 ops/sec

Addon sin sum: 16 operations performed at 4000000 ops/sec

Math.sin sum: 32 operations performed at 3555555 ops/sec

Addon sin sum: 32 operations performed at 4571428 ops/sec

Math.sin sum: 64 operations performed at 21333333 ops/sec

Addon sin sum: 64 operations performed at 3764705 ops/sec

Math.sin sum: 128 operations performed at 14222222 ops/sec

Addon sin sum: 128 operations performed at 907801 ops/sec

Math.sin sum: 256 operations performed at 17066666 ops/sec

Addon sin sum: 256 operations performed at 733524 ops/sec

Math.sin sum: 512 operations performed at 15515151 ops/sec

Addon sin sum: 512 operations performed at 5019607 ops/sec

Math.sin sum: 1024 operations performed at 12190476 ops/sec

Addon sin sum: 1024 operations performed at 5389473 ops/sec

Math.sin sum: 2048 operations performed at 13562913 ops/sec

Addon sin sum: 2048 operations performed at 5251282 ops/sec

Math.sin sum: 4096 operations performed at 13791245 ops/sec

Addon sin sum: 4096 operations performed at 3230283 ops/sec

Math.sin sum: 8192 operations performed at 12226865 ops/sec

Addon sin sum: 8192 operations performed at 4571428 ops/sec

Math.sin sum: 16384 operations performed at 12064801 ops/sec

Addon sin sum: 16384 operations performed at 4571428 ops/sec

Math.sin sum: 32768 operations performed at 17645665 ops/sec

Addon sin sum: 32768 operations performed at 5759887 ops/sec

The re-compilation occurs, and the actual overhead of calling C from JS becomes apparent:

Math.sin sum: 65536 operations performed at 22028907 ops/sec

Addon sin sum: 65536 operations performed at 5907869 ops/sec

Math.sin sum: 131072 operations performed at 21962466 ops/sec

Addon sin sum: 131072 operations performed at 5938114 ops/sec

Math.sin sum: 262144 operations performed at 21907404 ops/sec

Addon sin sum: 262144 operations performed at 5915869 ops/sec

Math.sin sum: 524288 operations performed at 22024280 ops/sec

Addon sin sum: 524288 operations performed at 5932336 ops/sec

-J

mog...@syntheticsemantics.com

unread,

Apr 23, 2014, 5:52:12 PM4/23/14

to v8-u...@googlegroups.com

I should point out this experiment came from when I was trying to replicate these results:

https://kkaefer.com/node-cpp-modules/#benchmark-thread-pool

In his case, the entire work function is optimized away by Crankshaft in a very obvious way. The experiment compares his loop body to a no-op.

-J

Work Function: Math.floor(133.7 / Math.PI)

1 workfun operations performed at 7092 ops/sec
1 no-ops performed at 333333 ops/sec

2 workfun operations performed at 142857 ops/sec
2 no-ops performed at Infinity ops/sec
4 workfun operations performed at Infinity ops/sec
4 no-ops performed at Infinity ops/sec
8 workfun operations performed at 8000000 ops/sec
8 no-ops performed at Infinity ops/sec
16 workfun operations performed at 16000000 ops/sec
16 no-ops performed at Infinity ops/sec
32 workfun operations performed at 16000000 ops/sec
32 no-ops performed at Infinity ops/sec
64 workfun operations performed at 9142857 ops/sec
64 no-ops performed at Infinity ops/sec
128 workfun operations performed at 405063 ops/sec
128 no-ops performed at Infinity ops/sec
256 workfun operations performed at 1855072 ops/sec
256 no-ops performed at 256000000 ops/sec
512 workfun operations performed at 64000000 ops/sec
512 no-ops performed at 256000000 ops/sec
1024 workfun operations performed at 68266666 ops/sec
1024 no-ops performed at 256000000 ops/sec
2048 workfun operations performed at 60235294 ops/sec
2048 no-ops performed at 292571428 ops/sec
4096 workfun operations performed at 52512820 ops/sec
4096 no-ops performed at 273066666 ops/sec
8192 workfun operations performed at 66064516 ops/sec
8192 no-ops performed at 282482758 ops/sec
16384 workfun operations performed at 59148014 ops/sec
16384 no-ops performed at 163840000 ops/sec

Suddenly a re-compilation with additional optimization occurs:
32768 workfun operations performed at 910222222 ops/sec
32768 no-ops performed at 910222222 ops/sec
65536 workfun operations performed at 923042253 ops/sec
65536 no-ops performed at 923042253 ops/sec
131072 workfun operations performed at 929588652 ops/sec
131072 no-ops performed at 929588652 ops/sec
262144 workfun operations performed at 929588652 ops/sec
262144 no-ops performed at 929588652 ops/sec
524288 workfun operations performed at 931239786 ops/sec
524288 no-ops performed at 931239786 ops/sec

Simon

unread,

Apr 23, 2014, 6:00:43 PM4/23/14

to v8-u...@googlegroups.com

Thanks for the info but hmmm... I'm a bit confused now. In the first example sent then 'addon sin sum' hardly changes at all after recompilation and 'homes in' on 5.9M ops/sec. In the second example sent then there's a massive jump for both after recompilation. Why the difference in behaviour? Under which circumstances can addons benefit from the recompilation? Thanks, Simon

mog...@syntheticsemantics.com

unread,

Apr 23, 2014, 6:37:16 PM4/23/14

to v8-u...@googlegroups.com

Simon,

One difference is the second set (replicating the experiment in https://kkaefer.com/node-cpp-modules/#benchmark-thread-pool) uses a synthetic workload which the compiler can get rid of entirely, so the benchmark isn't timing work, it's the same as executing a no-op. A second difference is the timings include some combination of optimized and unoptimized execution which isn't a number you can use to make performance predictions based on number of iterations.

The problem is the work function is invariant and the results are not used, so the compiler is free to hoist the loop body out or just get rid of it:

function() { return Math.floor(133.7 / Math.PI); }

My test loop calls sin() which the compiler cannot analyze so it must assume there are side effects, and it must call the function every iteration. Additionally, the return values are summed so the compiler can't optimize the loop to execute only the last iteration, they must all be executed:

for(var i = 0; i < nOps; i++) {

sum += Math.sin(i)

}

Crankshaft performs many additional optimizations (dead code elimination, hoisting, native compilation, etc.), but v8 can't recompile the interface to a native addon so all the copy-in/out scaffolding remains, the use of Math.sin allows that overhead to be optimized. For practical purposes, the overhead of copy in/out is the only difference between the two sin() experiments.

Regardless of how it's compiled, as the trip counts increase the performance asymptotically approaches some maximum for the architecture. If anything, native code gets in the way of Crankshaft optimizations which is why the benefit is smaller for the native addon experiments than JS code alone.

The overhead for copy-in/out is significant, but unavoidable. For EMS, the benefit of the overhead was getting access to all the cores, and that performance multiplier easily overcomes the overhead. FWIW, the additional overhead for handling strings is relatively small, so you shouldn't consider use limited to scalar values.

-J

Simon

unread,

Apr 23, 2014, 6:48:14 PM4/23/14

to v8-u...@googlegroups.com

Thanks for the detailed reply.

In my results then using string values can be significantly slower than scalar values if the string is big enough, e.g. a C++ function returning an 8 byte string can be called 7.1M times per second, but change the string returned to a 4KB string and that 7.1M drops down to only 1.4M times per second. Whereas, the same 4KB string returned using String::NewExternal() manages 5.4M calls per second. So you might consider adding String::NewExternal() to ems if it is not used already :-)

* estimate 7142827 calls per second; object: unwrapped, input: 3 ints, output: 8 byte str

* estimate 1428573 calls per second; object: unwrapped, input: 3 ints, output: 4KB byte str

* estimate 5405379 calls per second: object: unwrapped, input: 3 ints, output: 4KB byte str external

Also, in case you haven't already seen it, then whitedb [1] reminds me a bit of ems. Seems like somebody has also attempted a node port of whitedb too [2].

--

Simon

[1] http://whitedb.org/

[2] https://github.com/brettlangdon/node-wgdb

Reply all

Reply to author

Forward