Emscripten WebGL VS native OpenGL benchmark

Jean-Marc Le Roux

unread,

Aug 18, 2014, 11:11:34 AM8/18/14

to emscripte...@googlegroups.com

Hi there!

As part of the next beta of Minko, our x-platform, free and open source 3D engine, we're heavily working on some major performance improvement.
I took some time to write a proto 3D engine to measure OpenGL "raw performance". My goal was to:
- see how to get the same results with an actual high level 3D engine;
- see how WebGL performs compared to OpenGL thanks to Emscripten.

TL;DR try it for yourself (results in the dev console):

http://minko.io/wp-content/uploads/2014/08/minko-example-cube.html

more details on the forum:

http://minko.io/forums/topic/performance-improvement/

Now the asm.js version is a lot slower than the native one: more than 10 times slower.
It doesn't really fit with Emscripten's usual figures that are between 2 and 5 times slower than native.

Is there anything we're doing wrong?
Is WebGL the bottleneck?

Anyway, feedback appreciated :)

Floh

unread,

Aug 18, 2014, 12:30:47 PM8/18/14

to emscripte...@googlegroups.com

The findings from my own benchmarking are basically that the problem is not JS performance, but WebGL call overhead, thus it is extremely important to reduce the number of calls into WebGL.

Most surprising to me is that a WebGL application on Windows can easily beat a native desktop OSX application, because the OSX OpenGL driver sucks so badly (at least in my MBP with an Intel HD 4000 driver). But staying on Windows, a native desktop application can easily have 10x more draw call throughput then a WebGL app running on the same machine, BUT not because of slow JS performance, but because of WebGL overhead.

I have 3 test scenarios, all numbers are roughly "number of instances drawn per frame until frame rate drops below 60fps":

- naive drawing with unique draw calls (1 uniform update, 1 draw call, no other state changes inbetween), object positions are computed on CPU. Best case here is about 70k draws on Windows native OpenGL with NVIDIA, on OSX it starts to drop below 60fps at around 12k instances, and with WebGL it's between 5k and 6k draws (browser, platform or CPU doesn't matter in this case)

- next I tested drawing with ANGLE_instanced_arrays, object positions are computed on CPU, written to a (double-buffered) dynamic vertex buffer, and then rendered with a single draw call, in Chrome on Windows with NVIDIA I can get 450k instances before the performance drops below 60fps (so 450k particle position updates per frame in JS, and no sweat!), performance in a native app isn't better here, my suspicion is that the vertex buffer update is the limiter here (500k instances means 8MByte of dynamic vertex data shuffled to the GPU each frame), on my OSX MBP I can go up to about 180k instances (again very likely vertex throughput limited). However in this case, the way the dynamic vertex buffer works is also important, it looks like vertex buffer orphaning is useless in WebGL (see discussion here: https://groups.google.com/forum/#!topic/webgl-dev-list/vMNXSNRAg8M), so I switched to double-buffering

- finally I tried to do everything on the GPU with 2 passes, first evaluate particle positions in a fullscreen-quad fragment shader, and then use vertex shader texture fetch to place the particles, also using instanced rendering, this goes up to about 800k instances on my Windows/NVIDIA machine in Chrome, but doesn't improve on my OSX machine (I guess the problem there is rendering to a RGBA32F render target which might overload an Intel HD4000 a bit).

Also, the PNaCl versions of the demos have about the same limitations as the WebGL version, pointing to the browser's GL wrapper as the bottleneck.

Here are the demo links:

- naive drawing with unique draw calls: http://floooh.github.io/oryol/DrawCallPerf.html

- hardware instanced rendering, CPU position updates: http://floooh.github.io/oryol/Instancing.html

- fully GPU rendered: http://floooh.github.io/oryol/GPUParticles.html

So in conclusion:

- JS performance is perfectly fine, both in Chrome and FF (even IE11 and the latest Safari)

- WebGL is the bottleneck, try to minimize calling into WebGL as much as possible

- OSX OpenGL sucks ass, especially with an Intel GPU

There's also a very handy benchmark table here from the bgfx engine which nicely shows what draw call performance to expect on various platforms:

https://github.com/bkaradzic/bgfx#17-drawstress

Cheers,

-Floh.

Alon Zakai

unread,

Aug 18, 2014, 6:09:48 PM8/18/14

to emscripte...@googlegroups.com

Running this in Firefox's JS profiler, there seem to be a lot of invoke_* calls, which are calls out of asm.js into a try-catch, then back in. They are used for setjmp and for C++ exceptions - does your code use one of those? If so, then it can explain the slowdown, as the overhead in emscripten for setjmp and C++ exceptions is much, much higher than in native builds.

- Alon

--
You received this message because you are subscribed to the Google Groups "emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to emscripten-disc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jean-Marc Le Roux

unread,

Aug 19, 2014, 1:37:27 AM8/19/14

to emscripte...@googlegroups.com

Thanks for the feedback Alon!

Does this mean that :
a) the code is throwing then catching some actual exceptions at everyframe
OR b) the code has a try/catch at every frame but no actual exceptions, but it has to call invoke_* because of the try/catch block?

When I run my native code I don't see any exception.

When I follow those invoke_* calls, i either find:
- calls to webgl itself (_glUniformMatrix4fv, _glDrawElements, _glUniform4fv)
- calls to asm.js code

How could we proceed to understand where those invoke_* come frome?

Thanks,

Jean-Marc Le Roux

unread,

Aug 19, 2014, 2:36:52 AM8/19/14

to emscripte...@googlegroups.com

So I tried removing a try/catch block and it definitely removed one of those 2 heavy invoke_* stacks.
The one that finally calls to webgl is still here.

I've tried commenting my OpenGL calls. The obvious result is the corresponding calls to WebGL disapears.
But the (huge) invoke_* stack that was on top of it remains (except without the WebGL call at the bottom of it of course).
So that's definitely not about how I call OpenGL or the WebGL bindings.

Now what I've tried is to run the native app and add a breakpoint in the function that was calling OpenGL.
That should be the "bottom" of that massive invoke_* stack.
Was I've found is that I have a single try/catch block very high in the callstack in the method that execute our "complete" signal for asset loading.

So as far as I understand: a try/catch will make any inclusive function call to be wrapped in an invoke_* method to handle potential exceptions.
Whether an active exceptions exists/is thrown does not matter.
Correct?

Jukka Jylänki

unread,

Aug 19, 2014, 4:54:40 AM8/19/14

to emscripte...@googlegroups.com

Do you perhaps call to GL via custom manually obtained function pointers? (i.e. eglGetProcAddress or similar) If so, that will exactly amount to a invoke_xx() statement, since that will cause an asm.js code call to non-asm.js code via a function pointer. In Emscripten, all GL functions are available by static linking at compilation time - even all functions for all WebGL extensions. For best performance, the xxxGetProcAddress route should be avoided.

--

Jean-Marc Le Roux

unread,

Aug 19, 2014, 4:58:10 AM8/19/14

to emscripte...@googlegroups.com

I use the gl* functions directly.

Could give more details about the function pointer thing? We rely on a signal class that collects callbacks as std::function objects.
I'm guessing std::function stores function pointers internally (among other things).
Does it mean that dereferencing and calling a function pointer implies an invoke_* call?

--
You received this message because you are subscribed to a topic in the Google Groups "emscripten-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/emscripten-discuss/HH86_XDgLj4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to emscripten-disc...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Jean-Marc Le Roux

Founder and CEO of Aerys (http://aerys.in)

Blog: http://blogs.aerys.in/jeanmarc-leroux
Cell: (+33)6 20 56 45 78

Phone: (+33)9 72 40 17 58

Mark Callow

unread,

Aug 19, 2014, 3:52:47 PM8/19/14

to emscripte...@googlegroups.com

On 2014/08/18 9:30, Floh wrote:

So in conclusion:

- JS performance is perfectly fine, both in Chrome and FF (even IE11 and the latest Safari)

- WebGL is the bottleneck, try to minimize calling into WebGL as much as possible

- OSX OpenGL sucks ass, especially with an Intel GPU

It sounds like you are modifying the vertices for each draw call. If that is the case then to determine webgl vs. native performance you should compare WebGL drawing with drawing using buffer objects in native OpenGL It is not clear you did that in all cases you cited. Comparing client-side array draws with WebGL draws (which use buffer objects) will incorrectly assign "overhead" to WebGL.

Regards

-Mark

--

注意：この電子メールには、株式会社エイチアイの機密情報が含まれている場合が有ります。正式なメール受信者では無い場合はメール複製、再配信または情報の使用を固く禁じております。エラー、手違いでこのメールを受け取られましたら削除を行い配信者にご連絡をお願いいたします.

NOTE: This electronic mail message may contain confidential and privileged information from HI Corporation. If you are not the intended recipient, any disclosure, photocopying, distribution or use of the contents of the received information is prohibited. If you have received this e-mail in error, please notify the sender immediately and permanently delete this message and all related copies.

Alon Zakai

unread,

Aug 19, 2014, 7:15:02 PM8/19/14

to emscripte...@googlegroups.com

Yes, that's basically correct. A try-catch means that we need to be able to catch exceptions inside it, so

try {

func1();

func2();

} catch (type t) { ... }

will result in two invokes, to be able to capture exceptions on each of those calls.

Furthermore, a try-catch is necessary if an exception may pass through your call frame,

void func() {

SomeRAIIType raii;
...

func1();

callThatMayThrow();

func2();

...
}

If the call that might throw throws, we still need to call raii's destructor. The semantics are as if you wrote in pseudocode

void func() {
SomeRAIIType raii;

try {

...

func1();

callThatMayThrow();

func2();

...

} finally {

raii.~raii();
}

}

And the bad thing is that the new() operator can throw an exception! So this can happen quite often, and we end up doing an invoke for the func1 and func2 calls.

We disable exceptions by default in -O1 and above for these reasons, and since the invoke overhead is quite high for us in the current architecture.

- Alon

--

Jean-Marc Le Roux

unread,

Aug 20, 2014, 3:42:32 AM8/20/14

to emscripte...@googlegroups.com

Hi,

we had "-s DISABLE_EXCEPTION_CATCHING=0" even in the release config.
Removing that flag made all the invoke_* wrappers disapear as expected.

Thanks for your help everyone anyway! I'll post the new binary ASAP to get more results.

If you have other inputs on how to make this even faster please let me know!

Floh

unread,

Aug 20, 2014, 9:33:15 AM8/20/14

to emscripte...@googlegroups.com

Yes of course, everything is in buffer objects, there are no client-side arrays at all in the engine. The buffers for the dynamic vertex data are double-buffered to avoid locks (however in the native case there still seems to be a lock where OpenGL has to wait, so maybe I'll even try triple buffering next).

Also, the demos which use minimal draw calls have basically identical performance between WebGL and native (e.g. the instancing demo only has one draw call for the scene, and the GPU particle demo has 2 draw calls, plus one more for the debug text output).

Another thing that should be avoided in WebGL at all cost (also in PNaCl) is to upload dynamic indices, since these must be checked for out-of-bounds indices, and the code behind this is surprisingly complex (Firefox build a binary tree structure from the indices to speed up future range checks for instance).

Cheers,

-Floh.

Jean-Marc Le Roux

unread,

Aug 20, 2014, 9:37:58 AM8/20/14

to emscripte...@googlegroups.com

Another thing that should be avoided in WebGL at all cost (also in PNaCl) is to upload dynamic indices, since these must be checked for out-of-bounds indices, and the code behind this is surprisingly complex (Firefox build a binary tree structure from the indices to speed up future range checks for instance).

By dynamic indices, you mean:
a) creating the index buffer with GL_DYNAMIC_DRAW
or b) calling glDrawElements() with an actual int* for the "indices" argument?

--
You received this message because you are subscribed to a topic in the Google Groups "emscripten-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/emscripten-discuss/HH86_XDgLj4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to emscripten-disc...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Floh

unread,

Aug 20, 2014, 9:56:18 AM8/20/14

to emscripte...@googlegroups.com

Both I guess, but mainly (a):

(a) if indices are dynamically written every frame then the index buffer (or a range of it) is marked as dirty and must be re-validated (I guess this range check happens later at draw-call-time, and the actual details heavily between Firefox and Chrome as far as I know), but the important thing is that every time indices are updated a fairly expensive operation needs to take place

(b) I guess this(putting indices not into buffer objects but into raw client-side memory) means that the WebGL implementation has to assume that the index data is different for each call, but I am not sure of the implementation details

The take-away is that dynamic index updates are much more expensive then on desktop GL and should be avoided at all cost (I usually pre-allocate a big index buffer and fill it with as many triangle indices as I am rendering at maximum, and then render a part of it, depending on how many primitives I want to render).

-Floh.

To unsubscribe from this group and all its topics, send an email to emscripten-discuss+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jean-Marc Le Roux

unread,

Aug 20, 2014, 10:00:10 AM8/20/14

to emscripte...@googlegroups.com

It makes sense.

Are we sure that the current Emscripten OpenGL to WebGL bindings do not imply this kind of behavior despite the fact it's not used in the original OpenGL code?

There might be some kind of restriction (or faulty implem) that would cause all index buffers to be considered dirty/dynamic.
How can we check?

To unsubscribe from this group and all its topics, send an email to emscripten-disc...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jukka Jylänki

unread,

Aug 20, 2014, 12:54:12 PM8/20/14

to emscripte...@googlegroups.com

2014-08-20 16:37 GMT+03:00 Jean-Marc Le Roux <jeanmar...@aerys.in>:

> By dynamic indices, you mean:
> a) creating the index buffer with GL_DYNAMIC_DRAW
> or b) calling glDrawElements() with an actual int* for the "indices" argument?

a) I don't think the use of GL_{STATIC/STREAM/DYNAMIC}_DRAW changes how browsers deal with the data, the code paths should directly pass these hints to the GPU driver.

b) WebGL does not support calling glDrawElements with a CPU-side int* buffer as an argument. In Emscripten this is not supported either by default, unless you enable the -s FULL_ES2=1 link flag which adds emulated support for clientside arrays.

The performance hit that Floh is referring to with the binary tree occurs whenever you use glBuffer(Sub)Data to update an index buffer, potentially lazily at next draw.

Mark Callow

unread,

Aug 20, 2014, 2:15:29 PM8/20/14

to emscripte...@googlegroups.com

On 2014/08/20 6:37, Jean-Marc Le Roux wrote:

By dynamic indices, you mean:
a) creating the index buffer with GL_DYNAMIC_DRAW

GL_DYNAMIC_DRAW is just a hint. I'm not sure if any WebGL implementation pays attention to it in its OpenGL ES wrapper. What causes revalidation is changing the data in the buffer with bufferData or bufferSubData. This will happen regardless of the usage hint.

Reply all

Reply to author

Forward