Non-Blocking Readback from GPU

Evgeny Demidov

unread,

May 15, 2019, 4:28:11 AM5/15/19

to WebGL Dev List

Accidentally I found this at https://www.khronos.org/assets/uploads/developers/library/2018-siggraph/03-WebGL-BOF-Intro_Aug18.pdf . I'm using "dummy" readback to measure GEMMs timing after gl.dispatchCompute().

1. Hope it is blocking in Canary (with no fences).

2. Are there fence using examples? E.g. something like

var fence = gl.fenceSync(gl.SYNC_GPU_COMMANDS_COMPLETE, 0);
setTimeout(() => {
gl.clientWaitSync(fence, gl.SYNC_FLUSH_COMMANDS_BIT, 1000000);
// gl.getBufferSubData(gl.TRANSFORM_FEEDBACK_BUFFER, 0, dataOut);
}, 0);

Evgeny

Jeff Gilbert

unread,

May 15, 2019, 12:46:53 PM5/15/19

to webgl-d...@googlegroups.com

Efficient non-blocking readback/download of data from the GPU is a
little tricky:
https://jdashg.github.io/misc/async-gpu-downloads.html

> --
> You received this message because you are subscribed to the Google Groups "WebGL Dev List" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to webgl-dev-lis...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/webgl-dev-list/29c71491-1226-4dde-a8ca-bee34d63b4f6%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Kai Ninomiya

unread,

May 16, 2019, 3:55:54 PM5/16/19

to WebGL Dev List

If you want to measure raw GPU performance, consider using EXT_disjoint_timer_query.

Non-blocking readbacks won't help for measuring GPU performance. Their purpose is to reduce blocking time in JS, but the latency until the data returns is higher with non-blocking readback. So I would expect your timing to become less precise and measure more overhead.

Kai Ninomiya

unread,

May 16, 2019, 3:57:21 PM5/16/19

to WebGL Dev List

Actually the latency could be either higher or lower. But either way I think it will introduce more variance, as the timing becomes dependent on the browser scheduler.

Jeff Gilbert

unread,

May 16, 2019, 6:57:02 PM5/16/19

to webgl-d...@googlegroups.com

Oh, yeah, if you're just timing things, timer queries is what you want, where exposed. (Definitely not everywhere)

You can also get ok benchmarking data by sending a ton of work (hundreds of milliseconds worth) and checking the return value of clientWaitSync, or getSyncParameter for the completion status of the fence sync.

On Thu, May 16, 2019, 12:57 PM Kai Ninomiya <kai...@chromium.org> wrote:

Actually the latency could be either higher or lower. But either way I think it will introduce more variance, as the timing becomes dependent on the browser scheduler.

--
You received this message because you are subscribed to the Google Groups "WebGL Dev List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to webgl-dev-lis...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/webgl-dev-list/1b0c8da3-b47c-4800-8020-ee521396027f%40googlegroups.com.

Evgeny Demidov

unread,

May 16, 2019, 11:43:34 PM5/16/19

to webgl-d...@googlegroups.com

Time is very interesting thing. On low-end AMD A6-5200 APU (I never seen results from high-end yet :) I get for Shader 7 (compute shader + ssbo at https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm.htm)

N = 128 time = 29.3ms GFLOPS=0.07

N = 256 time = 29.3ms GFLOPS=0.57

N = 512 time = 32.5ms GFLOPS=4.13 (~N^3)

...

N = 2048 time = 219.5ms GFLOPS=39.13

N = 4096 time = 1534.3ms GFLOPS=44.79

(note that for matrix multiplication FLOP ~ N^3, N - square matrix size).

So for GEMM "pure performance" ~ 45 GFLOPS, but time include ~ 30 ms of overheads and we can get high performance only for T ~ 1 sec. Evidently for PoseNet (pose, face... recognition) one needs 60 FPS and T ~ 20 ms.

But in real applications CPU and GPU work together asynchronously and may hide (30 ms) CPU overheads. We need to make e.g. PoseNet to see results :)

Even more interesting for TFjs GEMM (WebGL + textures) I just get

N = 16 time = 2.8ms GFLOPS=0

N = 64 time = 2.8ms GFLOPS=0.09

Why in WebGL (render square) overheads are 10 times smaller than in webgl2-compute (dispatch)?

Evgeny Demidov

unread,

May 17, 2019, 12:07:09 AM5/17/19

to webgl-d...@googlegroups.com

sorry, it was D3D11 backend. For OpenGL + webgl2compute I get

N = 128 time = 1.9ms GFLOPS=1.11

N = 256 time = 2.4ms GFLOPS=7.03

So overheads ~ 2 ms as for WebGL.

Reply all

Reply to author

Forward