Cedric Nugteren have optimized OpenCL kernels for Nvigia GPUs (although they are 2 times slower than cuBLAS kernels with GPU assembler codes). One can use similar shaders 6,7 in webgl2-compute, WebGPU, Vulcan https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm.htm
Unfortunately Cedric have interesting observation at clBlas on AMD GPUs that these kernels are not optimal for AMD GPUs. He made one more shader 11 for Radeon R9 280X (2550/1960 faster than shader 7)
at Inside clBlas. It is 1370/830 slower on Nvidia GPU.
Kai Ninomiya wrote me "When I ran the sgemm6b shader *in C++ with Dawn*, even ignoring that the input needed to be transposed, I saw pretty good results but they didn't quite reach our shader tuned for my hardware. I never got a chance to tune sgemm6b for my hardware" (may be he need use fastest on AMD sgemm7b).
As Jiajia wrote, on Intel GPUs shaders 6,7 need for parameters tuning with D3D drivers (they are fast with OpenGL backend). Have not any numbers (just Cederic's data for OpenCL).
Surprisingly I get high performance for shaders 6,7 on my small GPUs (GT 710 and AMD A6-5200 APU). I have not any webgl2-compute test results on modern main-stream GPUs (too numerous :)
Not sure if browser can detect GPU type for privacy reason. All that may be important for ML in WebGL and WebGPU. A prompt to choose shader type?
Evgeny