Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Overhead of returning a string from C++ to JS over WebIDL bindings

95 views
Skip to first unread message

Henri Sivonen

unread,
Jun 16, 2017, 7:23:23 AM6/16/17
to dev-platform
I noticed a huge performance difference between
https://hsivonen.com/test/moz/encoding_bench_web/ and
https://github.com/hsivonen/encoding_bench/ . The former has the
overhead of JS bindings. The latter doesn't.

On a 2009 Mac Mini (Core 2 Duo), in the case of English, the overhead
is over twice the time spent by encoding_rs in isolation, so the time
per iteration over triples!

The snowman test indicates that this isn't caused by SpiderMonkey's
Latin1 space optimization.

Safari performs better than Firefox, despite Safari using ICU (except
for UTF-8 and windows-1252) and ICU being slower than encoding_rs in
isolation on encoding_bench (tested on Linux). Also, looking at
Safari's UTF-8 and windows-1252 decoders, which I haven't had the time
to extract for isolated testing, and Safari's TextDecoder
implementation, there's no magic there (especially no magic compared
to the Blink fork of the UTF-8 and windows-1252 decoders).

My hypothesis is that the JSC/WebKit overhead of returning a string
from C++ to JS is much lower than SpiderMonkey/Gecko overhead or the
V8/Blink overhead. (When encoding from string to ArrayBuffer, Safari
doesn't have the advantage, which is also suggestive of this not being
a matter of how GC happens relative to the timed runs.)

Do we know why?

--
Henri Sivonen
hsiv...@hsivonen.fi
https://hsivonen.fi/

Jan de Mooij

unread,
Jun 16, 2017, 8:08:51 AM6/16/17
to Henri Sivonen, dev-platform
I profiled this quickly and we're spending a lot of time in GC. Nursery
allocating strings (bug 903519) is going to help a lot here as it will make
both string allocation and GC much faster. Objects are already nursery
allocated, so that's probably why ArrayBuffer is faster.

We create external strings (a JS string referencing a DOM string buffer)
for the strings returned from DOM to JS and in my profiles I don't see us
spending a lot of time on this. More than 70% of the time is in
encoding_rs. It may be different for other parts of the benchmark - it
would be nice to have a minimal testcase showing the problem.

Hope this helps,
Jan
> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>

Jan de Mooij

unread,
Jun 16, 2017, 8:41:28 AM6/16/17
to Henri Sivonen, dev-platform
On Fri, Jun 16, 2017 at 2:08 PM, Jan de Mooij <jdem...@mozilla.com> wrote:

> Objects are already nursery allocated, so that's probably why ArrayBuffer
> is faster.
>

Sorry, I think ArrayBuffers are not nursery allocated right now, so
allocating a ton of them would also trigger major GCs.

Also note that we have an external string cache (see
ExternalStringCache::lookup), it compares either the char pointers or the
actual characters (if length <= 100). If we are returning the same strings
(same characters) all the time on this benchmark, you could try removing
that length check to see if it makes a difference.

Jan

Boris Zbarsky

unread,
Jun 16, 2017, 4:19:56 PM6/16/17
to
On 6/16/17 7:22 AM, Henri Sivonen wrote:
> My hypothesis is that the JSC/WebKit overhead of returning a string
> from C++ to JS is much lower than SpiderMonkey/Gecko overhead or the
> V8/Blink overhead.

It definitely is. JSC and WebKit use the same exact refcounted strings,
last I checked, so returning a string from WebKit to JSC involves a
single non-atomic refcount increment. It's super-fast.

-Boris

Henri Sivonen

unread,
Jun 22, 2017, 4:44:06 AM6/22/17
to dev-platform
On Fri, Jun 16, 2017 at 3:08 PM, Jan de Mooij <jdem...@mozilla.com> wrote:
> I profiled this quickly and we're spending a lot of time in GC.

OK. So I accidentally created a string GC test instead of creating a
TextDecoder test. :-(

Is there a good cross-browser way to cause GC predictably outside the
timed benchmark section in order to count only the TextDecoder run in
the timing?

On Fri, Jun 16, 2017 at 3:41 PM, Jan de Mooij <jdem...@mozilla.com> wrote:
> Also note that we have an external string cache (see
> ExternalStringCache::lookup), it compares either the char pointers or the
> actual characters (if length <= 100). If we are returning the same strings
> (same characters) all the time on this benchmark, you could try removing
> that length check to see if it makes a difference.

The length of the string is always well over 100, so that already
means that a string cache isn't interfering with the test, right?
(Interfering meaning making the test reflect something other that the
performance of the back end used by TextDecoder.)
Jan said "We create external strings (a JS string referencing a DOM
string buffer) for the strings returned from DOM to JS", so that means
Gecko does roughly the same in this case, right? Could it be that JSC
realizes that nothing holds onto the string and derefs it right away
without having to wait for an actual GC?

Henri Sivonen

unread,
Jun 22, 2017, 6:20:00 AM6/22/17
to Jan de Mooij, dev-platform
On Fri, Jun 16, 2017 at 3:08 PM, Jan de Mooij <jdem...@mozilla.com> wrote:
> It may be different for other parts of the benchmark - it would be nice to
> have a minimal testcase showing the problem.

https://hsivonen.com/test/moz/encoding_bench_web/english-only.html is
minimized in the sense that 1) it runs only one benchmark and 2) it
does the setup first and then waits for the user to click a button,
which hopefully makes it easier to omit the setup from examination.

Boris Zbarsky

unread,
Jun 22, 2017, 11:14:53 AM6/22/17
to
On 6/22/17 4:43 AM, Henri Sivonen wrote:
> The length of the string is always well over 100, so that already
> means that a string cache isn't interfering with the test, right?

The way the string cache works is that it will reuse an existing
JSString* in two situations:

1) The nsStringBuffer* is the same exact pointer as the cached thing.
2) The nsStringBuffer* is different, but length < 100 and the chars
match the cached thing.

In your case, situation 2 is not applying; presumably situation 1 is not
either, so you create a new JSString. I think Jan is suggesting that
you could make situation 2 apply by raising that "100" limit. Whether
the resulting compare is faster than allocating JSString and GC is not
obvious, of course.

> Jan said "We create external strings (a JS string referencing a DOM
> string buffer) for the strings returned from DOM to JS", so that means
> Gecko does roughly the same in this case, right?

With a bunch more overhead (atomic refcounts, function calls, etc), but yes.

> Could it be that JSC
> realizes that nothing holds onto the string and derefs it right away
> without having to wait for an actual GC?

Possible. I haven't looked at their code in a while.

-Boris


Boris Zbarsky

unread,
Jun 22, 2017, 1:32:55 PM6/22/17
to
On 6/22/17 6:19 AM, Henri Sivonen wrote:
> https://hsivonen.com/test/moz/encoding_bench_web/english-only.html

OK, so here's what I'm seeing on that benchmark, all numbers measured on
Mac with a current nightly; other platforms may differ, etc. Profile at
https://perf-html.io/public/f6a0a4a61edcd784b461d17ea3879c30e03ee7fb/calltree/?implementation=cpp&thread=2
for someone who wants to look themselves, but the numbers below are from
an Instruments profile, because there I can do a much saner job of
coalescing the various codepaths that lead to the functions of
interest... [1]

Out of a total time of 7.6s or so on the main thread, 5.4s is under
TextDecoderBinding::decode. 2.1s is under gcIfRequested. The remaining
time, what there is of it, is almost all under various JS execution bits.

The time under gcIfRequested is almost all under sweeping/finalization.
In fact, it's almost all under arena deallocation. I see 1.7s under
huge_dalloc/chunk_dealloc (on Mac), with most of that (1.4s) being the
madvise calls we make there and another 0.3s calling munmap. Depending
on what it is we're freeing (the GC arenas themselves or the string
data), either nursery allocation for external strings[2] or background
finalization of external strings[3] would help here. The stacks look
like we're freeing gc arenas, but it seems like there would be more
freeing of the string data...

The time under TextDecoderBinding::decode is almost all (5.34s out of
5.4s) under encoding_rs::Decoder::decode_to_utf16 and in particular in
encoding_rs::variant::VariantDecoder::decode_to_utf16_raw. That's all
self time in there. I do see numbers that are much slower than Safari
here, for what it's worth (about 2x), and the 25% of our time spent
under gc is not nearly enough to account for the difference.

-Boris

[1] Not least because of
https://github.com/devtools-html/perf.html/issues/388
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1375565
[3] https://bugzilla.mozilla.org/show_bug.cgi?id=627220

0 new messages