Compute shader synchronization?

1,749 views
Skip to first unread message

Evgeny Demidov

unread,
Mar 2, 2019, 3:02:31 AM3/2/19
to WebGL Dev List
in https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm3.htm
1. is used
 memoryBarrierShared(); - wait for WG shared memory writings completion
 barrier(); - wait for all WG shaders completion - isn't it excessive if all writings are complete?

2. in
 for (uint k=0u; k < TS; k++) {
  for (uint w=0u; w < WPT; w++) {
   acc[w] += Asub[k][row] * Bsub[col + w*RTS][k];
  }
 }
 // Synchronise before loading the next tile
 barrier();
why don't we wait until all acc[w] are stored accurately by WG (we wait only all WG shaders completion)?

and is it realized in the same way in D3D11?

https://stackoverflow.com/questions/39393560/glsl-memorybarriershared-usefulness#
"full memory barrier for everything is quite expensive" - but how to find the right one?

Evgeny

jacob bogers

unread,
Mar 3, 2019, 11:08:03 PM3/3/19
to WebGL Dev List
your demo at https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm2b.htm

gave me (console.log)
>max WG invoc=1024 size=1024
>Shared mem=0

Ok, why dont I have shared mem? Is there an option I have to turn on?

Cheers

jacob bogers

unread,
Mar 3, 2019, 11:08:03 PM3/3/19
to WebGL Dev List
Hi I tried your demo at https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm2b.htm

It doesnt complain about "compute shader" not being there '(using chrome canary, compute shaders turned on)

but now i get  "no shared memory yet"

What to do?


On Saturday, March 2, 2019 at 9:02:31 AM UTC+1, Evgeny Demidov wrote:

Qin, Jiajia

unread,
Mar 4, 2019, 12:16:55 AM3/4/19
to webgl-d...@googlegroups.com

barrier() is only used to wait for all invocations in a single work group. There is no sync between work group.

The first barrier is used to make sure a tile is loaded. All data has been uploaded to shared memory. So we need to synchronize all invocations in a single work group.

The second barrier is used to make sure that above tile data has been correctly calculated into acc. If this barrier is missed, above tile data may be modified before you save it to acc.

 

In D3D side, the translation is like below:

barrier                                               -> GroupMemoryBarrierWithGroupSync

memoryBarrierShared                   -> GroupMemoryBarrier

memoryBarrierAtomicCounter    -> DeviceMemoryBarrier

memoryBarrierBuffer                    -> DeviceMemoryBarrier

memoryBarrierImage                    -> DeviceMemoryBarrier

memoryBarrier                                -> AllMemoryBarrier

 

Details can be found here.

 

Regards,

Jiajia

--
You received this message because you are subscribed to the Google Groups "WebGL Dev List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to webgl-dev-lis...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Evgeny Demidov

unread,
Mar 4, 2019, 5:30:56 AM3/4/19
to WebGL Dev List
I think you use default D3D11 backend. Use the OpenGL one
in chrome://flags/  "Choose ANGLE graphics backend" or look at

Evgeny

Evgeny Demidov

unread,
Mar 4, 2019, 5:52:07 AM3/4/19
to WebGL Dev List


On Monday, March 4, 2019 at 8:16:55 AM UTC+3, Qin, Jiajia wrote:

The first barrier is used to make sure a tile is loaded. All data has been uploaded to shared memory. So we need to synchronize all invocations in a single work group.

as I understand "barrier()" waits only for all shaders in a WG (not writings)

why 2 operators 
  memoryBarrierShared(); - shared memory writings are completed
  barrier(); - isn't it excessive if all writings are complete?

  barrier(CLK_LOCAL_MEM_FENCE); - the only operator in OpenCL

The second barrier is used to make sure that above tile data has been correctly calculated into acc. If this barrier is missed, above tile data may be modified before you save it to acc. 

  barrier();  - waits only for all shaders in a WG but not for accurate writings to acc?

  barrier(CLK_LOCAL_MEM_FENCE); - in OpenCL looks like memoryBarrierShared() not barrier()

In D3D side, the translation is like below:

barrier                                               -> GroupMemoryBarrierWithGroupSync

memoryBarrierShared                   -> GroupMemoryBarrier

memoryBarrierAtomicCounter    -> DeviceMemoryBarrier

memoryBarrierBuffer                    -> DeviceMemoryBarrier

memoryBarrierImage                    -> DeviceMemoryBarrier

memoryBarrier                                -> AllMemoryBarrier

 

Details can be found here.

thank you, I'll read.

Evgeny

jiaji...@intel.com

unread,
Mar 5, 2019, 12:47:25 AM3/5/19
to WebGL Dev List
as I understand "barrier()" waits only for all shaders in a WG (not writings)
In the latest GLSL and ESSL spec, they have already clarified that a barrier() affects control flow but only synchronizes memory accesses to shared variables. It means barrier() by itself is enough to synchronize both control flow and memory accesses to shared variables. So we can use barrier() to replace 'memoryBarrierShared() + barrier()'.

barrier(CLK_LOCAL_MEM_FENCE); - the only operator in OpenCL
barrier(CLK_LOCAL_MEM_FENCE); - in OpenCL looks like memoryBarrierShared() not barrier() 
Based on the definition of barrier in OpenCL : All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. I think it's an invocation control function.  CLK_LOCAL_MEM_FENCE is more like memoryBarrierShared(). So barrier(CLK_LOCAL_MEM_FENCE) is equivalent to 'memoryBarrierShared() + barrier()'. Since barrier() already include memoryBarrierShared(), you can directly use barrier().

barrier();  - waits only for all shaders in a WG but not for accurate writings to acc?
Yes, barrier() is not for accurate writing to acc. But it makes sure that writing to acc happens before loading next tile. So I think the second barrier is a MUST to make sure the right execution sequence. The first barrier may be not necessary.  memoryBarrierShared() is enough in the first place.

Kentaro Kawakatsu

unread,
Mar 8, 2019, 1:35:23 AM3/8/19
to WebGL Dev List
Hi Jiajia.
I was wondering why Evgeny's matrix multiplication demo (https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm2.htm) has significant err on D3D backend though works well on GL backend.
(To see this demo on D3D backend, you need to comment out shared memory check code since getting MAX_COMPUTE_SHARED_MEMORY_SIZE on D3D has not been implemented yet I think.)

So I tried a simple shared memory test below:

This shader sums data read from shared memory and write back to SSBO that's all. The result should be all 130816. If one of the result elements does not equal to the expected value, consoles it.
I tried this on D3D, reloaded many times, then sometimes succeeded but sometimes failed. My environments are on GTX 780M and GTX 1080Ti; both occur this problem.
But this page always succeeds on GL backend, so I think this is shared memory sync issue which occurs only on D3D.

Next, I checked HLSL code converted by ANGLE using WEBGL_debug_shaders extension (I console it in above demo).
Converted HLSL code looks almost no problem, but I am concerned about one point: shared memory is initialized with zero-value with the declaration.
In OpenGL, it is not permitted to initialize shared memory when declaring. And that's not permitted also in CUDA. 

I also searched about HLSL, but I couldn't get the description of groupshared initialization - whether it is permitted or not.
But as far as my search, I couldn't find the HLSL example which groupshared is initialized when declaring.
When is this initialization executed? By whom? If each thread executes this initialization, is there a possibility that a later executed thread 0-overwrites the groupshared which is already written valid value by a preceded thread?

For more investigation, I tried with Unity ComputeShader(Windows D3D11) using the HLSL code as is which is translated by ANGLE.

I got the same result, sometimes succeeded but sometimes failed. Then I tried to comment out the initialization code like below; it becomes always succeeded.

Would you please take a look whether this auto inserted zero initialization is taking bad effect?

-Kentaro

jiaji...@intel.com

unread,
Mar 10, 2019, 11:21:15 PM3/10/19
to WebGL Dev List
Hi Kentaro

Thanks for your great analysis. We have reported a bug here. And a temporarily fixing is here.

In OpenGL, it is not permitted to initialize shared memory when declaring. And that's not permitted also in CUDA. 
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#assignment-operator
Can you quote the sentence where it is not permitted to initialize shared memory in OpenGL? I only find that 'Variables declared as shared may not have initializers and their contents are undefined at the beginning of shader execution.'

In WebGL, undefined behavior is not allowed due to the security. We should clearly point it out what will happen if the native is undefined. Here is some discussions about webgl2-compute security and shared memory initialization in ANGLE. 

Would you please take a look whether this auto inserted zero initialization is taking bad effect?
Yes, I have confirmed it. But I haven't got a good solution to fix it. Currently, I just simply disable the initialization of shared memory. Feel free to comment under the bug.

Thanks,
Jiajia

Kentaro Kawakatsu

unread,
Mar 17, 2019, 3:35:44 PM3/17/19
to WebGL Dev List
Hi Jiajia.
Sorry for my late response.

Sorry, it seems this issue had been already detected and fixed before my investigation... I'll search first from next time.
Thanks! I tried the latest Canary and confirmed shared memory works well on D3D backend(and also MAX_COMPUTE_SHARED_MEMORY_SIZE does).

Can you quote the sentence where it is not permitted to initialize shared memory in OpenGL? I only find that 'Variables declared as shared may not have initializers and their contents are undefined at the beginning of shader execution.'
Yes, I considered so from the sentence you indicated.
This may be my overinterpretation.
But currently, I get WebGL error like "ERROR: 0:6: 'shared' :  cannot initialize this type of qualifier" when compiling compute shader glsl code which initializes shared memory.
And here you discussed.

In WebGL, undefined behavior is not allowed due to the security. We should clearly point it out what will happen if the native is undefined. Here is some discussions about webgl2-compute security and shared memory initialization in ANGLE. 

Yes, I have confirmed it. But I haven't got a good solution to fix it. Currently, I just simply disable the initialization of shared memory. Feel free to comment under the bug.

Thank you. I read this discussion and  understood that this initialization had been added explicitly for web security

I have no good idea to solve this...
In my investigation, I also found this works well on D3D before your fix(simply disable the initialization of shared memory). So at least [force adding initialization when declaring] + [force adding synchronization at the top of main func] may work well. But synchronization is not good for performance as we know...

-Kentaro

Evgeny Demidov

unread,
Mar 22, 2019, 1:26:33 PM3/22/19
to WebGL Dev List
in https://www.ibiblio.org/e-notes/webgl/gpu/mul/sgemm2.htm
        // Synchronise to make sure the tile is loaded
        memoryBarrierShared();
        barrier();
2 operators are used to synchronize shared memory. But from discussion
"Compute: Control barrier and shared memory within the local work group #10"
at https://github.com/KhronosGroup/GLSL/issues/10 and latest ES 3.2 spec
https://www.khronos.org/registry/OpenGL/specs/es/3.2/GLSL_ES_Specification_3.20.html#shader-invocation-control-functions
it follows that just one barrier() call is enough.
1. shall I remove memoryBarrierShared(); call in WebGL2-compute (ES 3.1 based) ?
2. how does it work in D3D11 ?

Evgeny


jiaji...@intel.com

unread,
Mar 25, 2019, 4:39:52 AM3/25/19
to WebGL Dev List
1. shall I remove memoryBarrierShared(); call in WebGL2-compute (ES 3.1 based) ?
If the underlying driver already supports ESSL 3.2, I think we can remove memoryBarrierShared(). However, if the underlying driver is still in ESSL 3.1, it seems that we still need memoryBarrierShared() + barrier().
2. how does it work in D3D11 ?
In D3D11, the translation is like below:
memoryBarrierShared() becomes GroupMemoryBarrier()
barrier() becomes GroupMemoryBarrierWithGroupSync
So in d3d11, GroupMemoryBarrierWithGroupSync already contains GroupMemoryBarrier.

Regards,
Jiajia 

jiaji...@intel.com

unread,
Apr 19, 2019, 3:00:08 AM4/19/19
to WebGL Dev List
In my investigation, I also found this works well on D3D before your fix(simply disable the initialization of shared memory). So at least [force adding initialization when declaring] + [force adding synchronization at the top of main func] may work well. But synchronization is not good for performance as we know...
The feedback from Microsoft, "Initialization is just a different syntax telling the compiler to generate a bunch of writes to that memory, so yes, a barrier will be needed to ensure all threads see the data that was written.". 
Reply all
Reply to author
Forward
0 new messages