Matrix multiplication in Compute shaders

Evgeny Demidov

unread,

Feb 20, 2019, 2:00:53 PM2/20/19

to WebGL Dev List

Matrix multiplication at https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm.htm

Cedric Nugteren's results on Kepler GPU are very impressive (acceleration up to 10 times). I get only 20-25% acceleration from shared data.
Is it consequence my old HW or WebGL2-compute realization? ;)
Where can we find more testers on modern HW? Shader 1,2 benchmark.

As I can't use f16 (x2 acceleration) I'll try Shader 3 next (without hope of success :)

Evgeny

Ken Russell

unread,

Feb 20, 2019, 5:05:34 PM2/20/19

to WebGL Dev List, Qin, Jiajia

Nice work Evgeny and something we should definitely investigate.

Jiajia, any chance you or a teammate can look into this?

Also, per the other thread, https://bugs.chromium.org/p/angleproject/issues/detail?id=3160 has been filed to make half-floats work in compute shaders.

-Ken

--
You received this message because you are subscribed to the Google Groups "WebGL Dev List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to webgl-dev-lis...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Evgeny Demidov

unread,

Feb 21, 2019, 2:46:06 AM2/21/19

to WebGL Dev List

There are NxNxN FLOP (mul + add) in matrix multiplications (data are only ~NxN). Therefore there may be very high GFLOPS. We can exclude data preparation from timing and it is enough to demonstrate Compute shaders advantage.

For a TensorFlowJS library full performance is important. Matrix Vector multiplication and convolution operations performance is limited by bandwidth (only NxN FLOP). Therefore

1. can we use zero copy for SSBO on embedded GPU with common CPU+GPU memory (Intel, mobile)?

2. is it possible to make (temporarily) ANGLEFloat16Array, small WASM (native) function to convert JSfloat <-> Uint18Array or something else?

Evgeny

Evgeny Demidov

unread,

Feb 21, 2019, 2:15:30 PM2/21/19

to WebGL Dev List

as feared, I get no more than 50% acceleration for "Shader3 More work per thread" at

https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm.htm

instead of 500%. I'm under hesitation...

1. Compile OpenCL Kernels 1,2,3 for Windows 10 and test them

2. Make and compile OpenGL Shaders 1,2,3 for Windows 10 and test them

3. Read and think hard...

I do 3 for several days (not ready to install VS 17 again :). Can one compile Cedric Nugteren's demos and put them on Web? Really comparison results are enough.

Evgeny

Ken Russell

unread,

Feb 21, 2019, 2:44:07 PM2/21/19

to WebGL Dev List

Does https://github.com/petamoriken/float16 do what you want?

-Ken

Evgeny Demidov

unread,

Feb 22, 2019, 12:59:09 AM2/22/19

to WebGL Dev List

a quick answer (I'll try to test timings later)
1. a question to the Intel team, if ANGLE or driver copies buffers (in the same CPU+GPU common memory) or just pass pointer to the buffer (zero copy)? somewhere in
gl.bufferData(gl.ARRAY_BUFFER, data, gl.STATIC_DRAW);
2. does "petamoriken/float16" uses WASM, SIMD, native functions or it is pure JS?
As I think TensorFlowJS may have a lot of f16 buffers. Is it possible to accelerate "+, -, x Const, convolution" operations on buffers (not necessary in WebGL, sorry :) ? E.g. shall one generate data in JS, then transfer them into Uint16Array or work immediately on buffers?

fairly vague questions (and may be off WebGL topic a bit)

Evgeny

On Thursday, February 21, 2019 at 10:44:07 PM UTC+3, Kenneth Russell wrote:

Evgeny Demidov

unread,

Feb 22, 2019, 1:40:48 AM2/22/19

to WebGL Dev List

sorry, I get that 1 and 2 are not very important if all operation will be performed on GPU. For input output "petamoriken/float16" is enough. The only thing I'd like - more meaningful example than just multiplication.

Evgeny

Ken Russell

unread,

Feb 22, 2019, 2:17:09 PM2/22/19

to WebGL Dev List

On Thu, Feb 21, 2019 at 9:59 PM Evgeny Demidov <demidov...@gmail.com> wrote:

a quick answer (I'll try to test timings later)
1. a question to the Intel team, if ANGLE or driver copies buffers (in the same CPU+GPU common memory) or just pass pointer to the buffer (zero copy)? somewhere in
gl.bufferData(gl.ARRAY_BUFFER, data, gl.STATIC_DRAW);

I can answer this - gl.bufferData *always* makes a copy of the CPU-side data. Whether the driver makes more copies in addition to that one is something the Intel team would have to answer. In Chrome, at least two copies are made - one into the shared memory with the GPU process, and one from that shared memory into ANGLE and/or the graphics driver.

2. does "petamoriken/float16" uses WASM, SIMD, native functions or it is pure JS?

It's pure JS. Take a look.

As I think TensorFlowJS may have a lot of f16 buffers. Is it possible to accelerate "+, -, x Const, convolution" operations on buffers (not necessary in WebGL, sorry :) ? E.g. shall one generate data in JS, then transfer them into Uint16Array or work immediately on buffers?

Not sure about this question.

-Ken

fairly vague questions (and may be off WebGL topic a bit)

Evgeny

On Thursday, February 21, 2019 at 10:44:07 PM UTC+3, Kenneth Russell wrote:
On Wed, Feb 20, 2019 at 11:46 PM Evgeny Demidov <demidov...@gmail.com> wrote:
There are NxNxN FLOP (mul + add) in matrix multiplications (data are only ~NxN). Therefore there may be very high GFLOPS. We can exclude data preparation from timing and it is enough to demonstrate Compute shaders advantage.
For a TensorFlowJS library full performance is important. Matrix Vector multiplication and convolution operations performance is limited by bandwidth (only NxN FLOP). Therefore
1. can we use zero copy for SSBO on embedded GPU with common CPU+GPU memory (Intel, mobile)?

2. is it possible to make (temporarily) ANGLEFloat16Array, small WASM (native) function to convert JSfloat <-> Uint18Array or something else?

Does https://github.com/petamoriken/float16 do what you want?

-Ken

--

Qin, Jiajia

unread,

Feb 25, 2019, 4:35:45 AM2/25/19

to webgl-d...@googlegroups.com

Sorry for the late response. Some updates on https://bugs.chromium.org/p/angleproject/issues/detail?id=3160 Thanks.

Regards,

Jiajia

Kentaro Kawakatsu

unread,

Feb 28, 2019, 1:37:49 PM2/28/19

to WebGL Dev List

Hi Evgeny,

Nice effort. I'm interested.

I tried your work.

I got 0.000004 as "error" in Shader1, and 0.7 in Shader2.

All elements of source matrix are Math.random(), so each element of the multiplication result expects approximately 0.25N, in Shader2 case, 0.25 * 256 = 64.

"error" is the average of the difference between CPU result and GPU in each element, so this means 0.7 / 64 = about 1% error has occurred in GPU.

It looks pretty big.

I think you should synchronize each invocation within a workgroup to handle shared memory correctly.

barrier(CLK_LOCAL_MEM_FENCE) in OpenCL and groupMemoryBarrier() in OpenGL are not equivalent.

You should use memoryBarrierShared() and barrier().

for (uint t=0u; t<numTiles; t++) {

// Load one tile of A and B into local memory

uint tiledRow = TS*t + row;

uint tiledCol = TS*t + col;

Asub[col][row] = A[tiledCol*M + globalRow];

Bsub[col][row] = B[globalCol*K + tiledRow];

// Synchronise to make sure the tile is loaded

memoryBarrierShared();

barrier();

// Perform the computation for a single tile

for (uint k=0u; k<TS; k++) {

acc += Asub[k][row] * Bsub[col][k];

}

// Synchronise before loading the next tile

barrier();

}

(note: memoryBarrierShared() may not be needed.

https://github.com/KhronosGroup/GLSL/issues/10 )

I tried the above code, and then I got 0.00001 as "error". Please take a look.

Unfortunately, this change may make more impact on your benchmark result...

And ALL_BARRIER_BITS is not defined in WebGL2ComputeRenderingContext yet. (Is there any plan to implement it?)

SHADER_STORAGE_BARRIER_BIT is appropriate I think.

-Kentaro

Evgeny Demidov

unread,

Feb 28, 2019, 2:53:01 PM2/28/19

to WebGL Dev List

Hi Kentaro,

Thank you. I'll try (I nether got error more then 0.000011).
I've got "better" results with shared memory on GeForce GT710 and added matrix multiplication with textures in WebGL2. The conclusion is not clear yet. I think we need more meaningful application.
https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm.htm

By the way, one can use R16F textures in WebGL2. I'll try it in a while.

Evgeny

Kentaro Kawakatsu

unread,

Mar 1, 2019, 12:21:57 AM3/1/19

to WebGL Dev List

I nether got error more then 0.000011

I got 1.4 error in Shader2 and 0.7 in Shader3 with GTX 780M(Kepler, same as GT 710),
then 0.7 error in Shader2 and 0.05 in Shader3 with GTX 1080Ti(Pascal).
I checked both under OpenGL backend.
Sorry, I don't have any low-end environment.

Is the error relevant to GPU grade...?

-Kentaro

Qin, Jiajia

unread,

Mar 1, 2019, 5:09:44 AM3/1/19

to webgl-d...@googlegroups.com

I can also reproduce this issue on Intel Kaby Lake. Agree that ‘barrier(CLK_LOCAL_MEM_FENCE)’ is equivalent to ‘barrier()’ not ‘groupMemoryBarrier()’☺

>And ALL_BARRIER_BITS is not defined in WebGL2ComputeRenderingContext yet. (Is there any plan to implement it?)

PR has been sent out. Thanks.

Regards,

Jiajia

From: webgl-d...@googlegroups.com [mailto:webgl-d...@googlegroups.com] On Behalf Of Kentaro Kawakatsu

Sent: Friday, March 1, 2019 1:22 PM
To: WebGL Dev List <webgl-d...@googlegroups.com>

--

Evgeny Demidov

unread,

Mar 1, 2019, 7:35:36 AM3/1/19

to WebGL Dev List

I've seen too. I've corrected shaders with accordance to Kentaro recommendations and added shared memory test. I'll repair benchmarks in a while. But how it will be realized in D3D11 backend? :)

Evgeny

Evgeny Demidov

unread,

Mar 4, 2019, 8:04:32 AM3/4/19

to WebGL Dev List

"Shader 4: Wider data-types" (tests and benchmarks) for acceleration uses vec4 and mat2x4 data. One can hope for more x2 acceleration in the "shader6" and extra x2 for the HALF_FLOAT data.
https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm.htm

Evgeny

Hoda Naghibi

unread,

Mar 13, 2019, 1:29:41 PM3/13/19

to webgl-d...@googlegroups.com

Hi Evgeny,

Would you please share the whole javascript code for mul32 with texture in WebGL2? (I found just compute shader part of the code in the repository). Thank you so much.

Best,

Hoda

--

Evgeny Demidov

unread,

Mar 13, 2019, 4:06:15 PM3/13/19

to WebGL Dev List

look at the page source https://www.ibiblio.org/e-notes/webgl/gpu/mul/mul32.htm

Evgeny

On Wednesday, March 13, 2019 at 8:29:41 PM UTC+3, Hoda Naghibi wrote:

Evgeny Demidov

unread,

Mar 27, 2019, 8:58:00 AM3/27/19

to WebGL Dev List

Matrix multiplications in Compute shaders are 2 times faster than in TensorFlow.JS (WebGL backend). Hope it is enough :) They say that CUDA based TFJS backend is ~2 times faster than WebGL backend too.

https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm.htm

https://www.ibiblio.org/e-notes/webgl/gpu/mul/mul_tfjs.htm

Obviously the latest TFJS algorithm is highly optimized (an old one was as fast as my naive script).

Unfortunately e.g. for N = 1024 GFLOPS oscillate from 3 to 4.6 therefore it is difficult to get accurate performance numbers. How can I make more accurate measurements e.g. to plot N dependence chart?

Remark: Execution time includes ANGLE (translation into D3D11 and security tests), OpenGL/D3D11 driver times (all on CPU) and GPU execution time. Therefore:

1. The results of the ANGLE checks are cached so the command re-run runs much faster.

2. If several commands are executed sequentially CPU and GPU can work in parallel reducing computation time. So it is difficult to interpret the results of calculations for several iterations.

Evgeny

jacob bogers

unread,

Mar 27, 2019, 10:35:51 AM3/27/19

to webgl-d...@googlegroups.com

Hi, Evgeny

Not sure if i understand the numbers

C=A*B where N=2048 gives ms =449.4ms

So C is NxN is 4194304 elements,
each of this elements needs a multiplication of 2 vectors of (row A and column B) so 2048*2048 = 4194304 multiplications and (2048-1) additions

NxNx(NxN+N-1) operations for N:=2048 makes 17600771784704 operations

all done in 0.4494 seconds , so this is 17600771784704/0.4494/1E9 = 39'165GFLOPS

your browser says 30.83 GFLOPS.

Why am i seeing a difference or did i make a mistake somewhere?

Cheers

--

jacob bogers

unread,

Mar 27, 2019, 10:41:09 AM3/27/19

to webgl-d...@googlegroups.com

Woops, forgot my prev email

Number of operations (additions and multiplications) is NxNx(N+N-1)

So it comes out at 38.219GFLOP with your numbers, still 0.8GFLOP difference, but I added addition (innerproduct) in this case

Evgeny Demidov

unread,

Mar 27, 2019, 11:39:56 AM3/27/19

to WebGL Dev List

"the multiplications and additions can actually be fused into a single hardware instruction (FMA)"
https://cnugteren.github.io/tutorial/pages/page4.html
One can count FMA as 2 FLOPs (as e.g. Intel suggests) but Cedric Nugteren have inspected the PTX assembly e.g. at
https://cnugteren.github.io/tutorial/pages/page5.html
Therefore N*N*N/T is used approximately.

My problem is that unfortunately T oscillate too much!

Evgeny

jacob bogers

unread,

Mar 27, 2019, 11:45:25 AM3/27/19

to webgl-d...@googlegroups.com

Can you give a histogram/summary of several runs?? lets say 5 runs

--

jacob bogers

unread,

Mar 27, 2019, 11:46:19 AM3/27/19

to webgl-d...@googlegroups.com

PS: for me the runs are stable +- 0.1% , could be hardware / driver specifics though the platform you use.

Evgeny Demidov

unread,

Mar 27, 2019, 12:00:45 PM3/27/19

to WebGL Dev List

221.3ms 196.6ms 117.5ms 145.3ms AMD APU small enough
as I remember I get similar oscillations on GT 710. I can test Intel Atom Z3735F :-)
The first run is slowest usually. What is your HW?

Evgeny

Evgeny Demidov

unread,

Mar 27, 2019, 12:03:03 PM3/27/19

to WebGL Dev List

I forgot. I get similar oscillations on all Compute shaders too

Evgeny Demidov

unread,

Apr 6, 2019, 1:11:49 PM4/6/19

to WebGL Dev List

Cedric Nugteren's Shader 6: 2D register blocking at
https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm.htm
is ~5 times faster than TensorFlow.js matrix multiplication (WebGL based).
It is as fast as SiSoftware Sandra OpenCL FP32 GPU test. Tested on GeForce GT 710 (Windows 10, 64 bit). It would be great
1. compare Intel Skylake tests with CLBlast data.
2. optimize shaders for mobile HW

As 2 is impossible yet, I think we can try fluid simulations next.

If Intel team will use Cedric Nugteren's Shaders may be they put them on github?
(my JS style is too specific :)

Evgeny

Ken Russell

unread,

Apr 8, 2019, 1:35:00 PM4/8/19

to WebGL Dev List

Fantastic work Evgeny! Leading edge stuff!

--

Evgeny Demidov

unread,

Apr 14, 2019, 6:10:21 AM4/14/19

to WebGL Dev List

"Compute Shaders tuning"
Benchmarks for AMD A6-5200 APU are similar to Nvidia ones.
Unfortunately (fastest) shaders 6 and 7 are very slow on Intel Atom Z3735F.
https://www.ibiblio.org/e-notes/webgl/gpu/mul/atom.htm
We haven't heard anything from Intel team for a long time.

Evgeny

Evgeny Demidov

unread,

Apr 27, 2019, 9:50:27 AM4/27/19

to WebGL Dev List

Jiajia Qin (Intel) have found that the shader 6 needs optimization for

modern Intel GPU. It is the fastest for tile size = 64x64 and work-per-thread = 4.

On my old Atom Z3735F it is still slow.

https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm6i.htm

https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm.htm

Evgeny

Kai Ninomiya

unread,

May 7, 2019, 2:12:40 PM5/7/19

to WebGL Dev List

Hey Evgeny,

I was trying to understand some mysterious performance differences between your measurements and our measurements in TF.js yesterday, and I think I identified an issue in the timing methodology with your TF.js benchmark:

https://www.ibiblio.org/e-notes/webgl/gpu/mul/mul_tfjs.htm

It looks like (for some reason), the .dataSync() is very slow on TF.js (much slower than your getBufferSubData call in sgemm6b). It could be because they have to do an additional GPU-side copy in order to get the data back, I think (readPixels into PIXEL_PACK_BUFFER).

BTW, you had a comment about measuring the speed of the .dataSync() by reading back from matrix A. But since you just uploaded matrix A, the data is actually still stored on the CPU side so this readback isn't from the GPU. You can sort of test the readback speed by using a more trivial op, like tf.add(), and measuring the relative speed of that.

In our measurements, we've been amortizing the readback cost across many matrix multiplications, e.g. doing 50-200 chained matmuls and 1 readback. This isn't perfect, but gives a better idea of the cost without the readback.

A more ideal way to measure this would be using EXT_disjoint_timer_query, which can measure actual execution time on the GPU. But you would need to be able to somehow get the WebGL context object out of TF.js. (Not sure how hard that is.)

Let me know if I can help in any way with measuring things more accurately. Is your code on GitHub? I would be happy to open a pull request to improve the benchmark.

-Kai

Evgeny Demidov

unread,

May 7, 2019, 11:26:13 PM5/7/19

to WebGL Dev List

Hi Kai,

I've added "async" script similar to TFjs time measurements at

https://www.ibiblio.org/e-notes/webgl/gpu/mul/mul_tfjs2.htm
First "synchronous" script is a bit faster on AMD APU 120/130 ms for N = 1024. I'll try on GT 710 soon.

The difference in codes is tiny therefore I didn't use it earlier
var t = C.dataSync()[0]; or
await C.data();

IMHO to test TFjs WebGPU backent on Mac you should make simple WebGPU based matrix multiplication "demo" (to avoid TFjs overheads). It is beyond my capabilities (sorry for off topic :)

Evgeny

Reply all

Reply to author

Forward