A few questions about shader storage buffer usage

133 views
Skip to first unread message

Mark Sibly

unread,
Apr 15, 2023, 1:53:17 AM4/15/23
to Dawn Graphics
Hi,

I'm thinking of having a go at a 'linked list per pixel' system for order independent transparency.

Am I correct in thinking I can use a 'shader storage' buffer(s) to represent both a framebuffer sized array of 'head' pointers, and an array of free/used nodes? This will mean the shader wont actually generate any output, it'll just be updating the storage buffer and I'll be using the storage buffer in a separate pass to blend the colors and write them to the real framebuffer.

Will I need to use atomicLoad/atomicStore/atomicExchange etc when for example allocating nodes? And will these storage buffer members need to be declared atomic<u32> etc?

Finally, something I've always wondered - do shaders always take the same amount of time to execute, or does the GPU let them run as fast as possible and allocate them from a pool? For example, will an 'if' block (that isn't dependent on uniforms I guess) take less time to execute if the 'if' condition evaluates to false? Is it possible for shaders to 'halt'?!?

Bye,
Mark

Mark Sibly

unread,
Apr 16, 2023, 11:52:51 PM4/16/23
to Dawn Graphics
OK, I had a good go at this today and got lots of things figured out - please ignore my above questions unless it looks like I've got something wildly wrong.

I am however stuck at locking a section of critical code inside a shader (the lock below is a 'per pixel' lock, and my app is drawing 100 square overlapping sprites. To start with, I've been trying to implement a simple depth buffer with a storage buffer).

I started out with a classic lazy spinlock, eg:

while !atomicCompareExchange(lock, 0, 1).exchanged { }
...critical stuff here...
atomicStore(lock, 0);

Found a few variations on this, my favourite being I think:

while atomicExchange(lock, 1) == 1 { }

However, this approach would just crash the graphics card, usually quite politely with a 'lost device' error - so I added a retry counter to prevent starvation, eg:

var retries = 10;
while !atomicCompareExchange(lock, 0, 1).exchanged {
    if retries == 0 { return FAILURE }
    retries -= 1;
}

This worked better, although no matter how high I set retries I couldn't get it to work perfectly. Even with a million retries, there were still noticeable errors, and it was running VERY slowly!

So I tried something 'ticket' based, eg:

let ticket = atomicAdd(ticket, 1);
while atomicLoad(turn) != ticket { }
...critical stuff here...
atomicAdd(turn, 1);

I thought I had it that time, but this just locks up too!

It seems like the 'unlock' in all cases is somehow occasionally 'getting lost', ie: other shader instances can't see the unlocked value perhaps similar to non-atomic c++ vars? Either that or the 'busy loops' when locking are somehow crashing?

Any help, as always, would be greatly appreciated!

Bye,
Mark


--
You received this message because you are subscribed to the Google Groups "Dawn Graphics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dawn-graphic...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dawn-graphics/84c032c0-dba4-4c27-813c-88b82b5ec765n%40googlegroups.com.

Mark Sibly

unread,
Apr 17, 2023, 12:13:16 AM4/17/23
to Dawn Graphics
Hmmm, after a bit of googling I'm starting to get the distinct impression the answer is "you can't do that", ie: my unlock code is fine, you just can't busy wait in a shader...makes life a lot more interesting!

Kai Ninomiya

unread,
Apr 18, 2023, 11:16:24 AM4/18/23
to Mark Sibly, Dawn Graphics
Shaders use what's called a "SIMT" (Single-Instruction, Multiple-Thread) execution model to execute over SIMD hardware. All threads within one subgroup (usually 16/32/64 threads) must always stay at the same instruction pointer as there is only one for the whole subgroup. So if one thread is in a spin-loop, all threads in that subgroup are also in that spin-loop, even if they're not supposed to execute it - the hardware disables all visible effects from executing that code.

So while 15 of the threads are stuck in the wait loop, the 16th thread is stuck there too, and can't move on to the atomicStore.
In order for it to work, *all* threads must pass the atomicStore operation repeatedly (conditionally disabled in all but one thread), which means it needs to be inside the loop, which means you need a different condition to know when to end the loop.
Unfortunately I don't remember offhand how to do this in code, but hopefully this helps. This looks about right IIRC: https://stackoverflow.com/a/58555502

-Kai (he/they)


Mark Sibly

unread,
Apr 18, 2023, 10:30:01 PM4/18/23
to Kai Ninomiya, Dawn Graphics
Holy crap it works!

Here's my crappy little depth buffer test function:

    var locking = true;
    while locking {
        if atomicCompareExchangeWeak(&(*frag).lock, 0, 1).exchanged {
            if depth < (*frag).depth {
                (*frag).depth = depth;
                (*frag).color = color;
            }
            atomicStore(&(*frag).lock, 0);
            locking = false;
        }
    }

It's pretty sensitive to the ops used - it doesn't work with atomicExchange or the ticket system to do the locking, or with 'break' instead of the 'locking' var to exit the loop - but I'm super impressed it works at all, this should be very useful!

It also doesn't appear to be 'thrashing' at all - in fact it seems to always take exactly  2 iterations for the lock to happen but I'll need to look into that more closely.

I did manage to get a linked list and proper OIT working without a lock, but memory usage would be too brutal esp with HDR colors so time to learn some compute programming to sort my sprites I guess!

Also, the thesis linked to in the post you mentioned says something about a gl_HelperInvocation that apparently may be necessary on some cards, but there's nothing like this in wgsl - should I be worried?

Bye,
Mark

Loko Kung

unread,
Apr 19, 2023, 1:40:53 AM4/19/23
to Mark Sibly, Kai Ninomiya, Dawn Graphics
Hey, if you are looking to sort stuff, I recently worked on a WebGPU sorting library for one of our exploration sprints here: https://github.com/lokokung/webgpu-sort. It's still in its infancy and I eventually hope to make it a more general WebGPU algorithms package but feel free to take inspiration (or use the library) for sorting!

David Neto

unread,
Apr 19, 2023, 4:39:25 PM4/19/23
to Dawn Graphics

As Kai pointed out, whether a spin lock like this works or not depends on the particular details of the GPU and how the shader is compiled.

In general this is about the fine-grain execution model, and whether there are forward progress guarantees.  Not much is going to be very portable.
Sometimes there can be forward progress between subgroups, but you can't count on it.

I don't know the bigger shape of your workload. But what should work is to set up an abstract work list, then do work stealing.
Basically, all the invocations take an item to work on, try to do that piece of work, and if that succeeded, mark it as done.  Repeat until all the work is drained.
The key is to make it starvation-free in the face of an adversarial schedule.  (I hope I got the terms right).


david

Mark Sibly

unread,
Apr 19, 2023, 4:56:31 PM4/19/23
to David Neto, Dawn Graphics
Ahhh well, that's a shame, although playing around with it further it did all feel very fragile so I'm not too surprised. Might be a cool future GPU feature 'coz it definitely feels possible.

The worklist idea doesn't really work with fragment shaders though does it? In fact, all I can really do there is create worklists, ie: the per pixel linked lists.

Bye,
Mark

You received this message because you are subscribed to a topic in the Google Groups "Dawn Graphics" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dawn-graphics/OaF3I_M35eQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dawn-graphic...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dawn-graphics/d2ee244a-83cc-4925-85d1-3243b7e08084n%40googlegroups.com.

David Neto

unread,
Apr 19, 2023, 5:19:19 PM4/19/23
to Mark Sibly, Dawn Graphics
Right.  I wasn't sure if it was fragment shader or not.   The gl_HelperInvocation is only in fragment shader, and you're right WGSL doesn't have that feature.
IIRC it's not 100% portable as to when helper invocations come into existence and disappear. It's muddled when you poke at it.

david



Reply all
Reply to author
Forward
0 new messages