Struggling to build a performant sRGB blend function

Ben Harper

unread,

Dec 20, 2019, 9:24:14 AM12/20/19

to Intel SPMD Program Compiler Users

I'm trying to write a function using ISPC, which takes as input:

A scanline of a glyph, as rendered by Freetype. This is a uint8 alpha mask.
A target RGBA (uint8 x 4) surface.
A 3-element color (RGB) to draw onto the surface, masked by the glyph's alpha channel

and does the following:

Composite the glyph scanline onto the target surface, using the OVER operator, and with gamma-correct blending (ie transform source and destination from sRGB to linear float, then perform the blending in linear space, then transform back to sRGB, and write out to target surface).

This goal here is straightforward - it's just the final step needed to consume Freetype output, and show it on the screen.

Here is a gist of my ISPC function: https://gist.github.com/bmharper/c5d194dd04b79f8db55de60edff53ae0

I'm compiling with: ispc --target=avx2-i32x4 --opt=fast-math

It feels like I'm doing this wrong. I get a bunch of gather/scatter warnings, and the generated code of the inner loop feels a little too long, and I get the feeling I could build it quite a bit better if I hand-crafted it.

These are the ispc compiler warnings:

blend.ispc:21:17: Performance Warning: Conversion from unsigned int to float is slow. Use "int" if possible

float alpha = glyph[alphaChan] / 255.0f;

^^^^^^^^^^^^^^^^

blend.ispc:22:41: Performance Warning: Gather required to load value.

dst[i] = float_to_srgb8((1 - alpha) * sRGBToLinear[dst[i]] + alpha * color[i & 3]);

^^^^^^^^^^^^^^^^^^^^

blend.ispc:22:72: Performance Warning: Gather required to load value.

dst[i] = float_to_srgb8((1 - alpha) * sRGBToLinear[dst[i]] + alpha * color[i & 3]);

^^^^^^^^^^^^

Can anybody suggest a better paradigm?

Thanks,

Ben

Benjamin Legros

unread,

Dec 20, 2019, 9:41:58 AM12/20/19

to ispc-...@googlegroups.com

The lookup table will cause ispc to emit gather instructions, because the index of the lookup is varying. There's not really anything you can do about it, and unless you're targeting AVX2 and above, there's not hardware support for that on SSEx and AVX. Also, having the alpha channel interleaved with rgb will make the lanes interdependent on each other, which is not particularly simd friendly.

If you're not afraid of a little bit of approximations, you may consider replacing sRGB_to_linear with a square and linear_to_sRGB with a square root (that is, this approximates a 2.2 power function, which itself approximates the actual sRGB function.)

--
You received this message because you are subscribed to the Google Groups "Intel SPMD Program Compiler Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ispc-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ispc-users/9018ddcd-16a2-4f41-94df-5777319d0b23%40googlegroups.com.

Benjamin Legros

unread,

Dec 20, 2019, 9:48:30 AM12/20/19

to ispc-...@googlegroups.com

The lookup table will cause ispc to emit gather instructions, because the index of the lookup is varying AND not contiguous.

My bad.

Ben Harper

unread,

Dec 20, 2019, 2:05:07 PM12/20/19

to Intel SPMD Program Compiler Users

Thanks for the advice! I'll give those things a try and report back here how it goes.

Ben Harper

unread,

Dec 24, 2019, 2:39:43 AM12/24/19

to Intel SPMD Program Compiler Users

So, I ended up pre-expanding the alpha channel 4x, and then this simpler blend function:

export void BlendSRGB(uniform int nPix, uniform float color[], uniform uint8 glyph[], uniform uint8 dst[]) {
    foreach (i = 0 ... nPix * 4) {
        float alpha = glyph[i] / 255.0f;
        dst[i] = ToSRGBCheap((1 - alpha) * FromSRGBCheap(dst[i]) + alpha * color[programIndex]);
    }
}

inline float FromSRGBCheap(uint8 v) {
    float f = (float) v * (1.0f / 255.0f);
    return f * f;
}

inline uint8 ToSRGBCheap(float v) {
    return (uint8) (sqrt(v) * 255.0f);
}

and the resulting code looks reasonably good:

.LBB0_2:                                # %foreach_full_body
                                        # =>This Inner Loop Header: Depth=1
    movd    (%rdx,%rsi), %xmm6      # xmm6 = mem[0],zero,zero,zero
    punpcklbw   %xmm1, %xmm6    # xmm6 = xmm6[0],xmm1[0],xmm6[1],xmm1[1],xmm6[2],xmm1[2],xmm6[3],xmm1[3],xmm6[4],xmm1[4],xmm6[5],xmm1[5],xmm6[6],xmm1[6],xmm6[7],xmm1[7]
    punpcklwd   %xmm1, %xmm6    # xmm6 = xmm6[0],xmm1[0],xmm6[1],xmm1[1],xmm6[2],xmm1[2],xmm6[3],xmm1[3]
    cvtdq2ps    %xmm6, %xmm6
    mulps   %xmm2, %xmm6
    movaps  %xmm8, %xmm7
    subps   %xmm6, %xmm7
    movd    (%rcx,%rsi), %xmm3      # xmm3 = mem[0],zero,zero,zero
    punpcklbw   %xmm1, %xmm3    # xmm3 = xmm3[0],xmm1[0],xmm3[1],xmm1[1],xmm3[2],xmm1[2],xmm3[3],xmm1[3],xmm3[4],xmm1[4],xmm3[5],xmm1[5],xmm3[6],xmm1[6],xmm3[7],xmm1[7]
    punpcklwd   %xmm1, %xmm3    # xmm3 = xmm3[0],xmm1[0],xmm3[1],xmm1[1],xmm3[2],xmm1[2],xmm3[3],xmm1[3]
    cvtdq2ps    %xmm3, %xmm3
    mulps   %xmm2, %xmm3
    mulps   %xmm3, %xmm3
    mulps   %xmm7, %xmm3
    mulps   %xmm0, %xmm6
    addps   %xmm3, %xmm6
    sqrtps  %xmm6, %xmm3
    mulps   %xmm4, %xmm3
    cvttps2dq   %xmm3, %xmm3
    pand    %xmm5, %xmm3
    packuswb    %xmm3, %xmm3
    packuswb    %xmm3, %xmm3
    movd    %xmm3, (%rcx,%rsi)
    addq    $4, %rsi
    cmpq    %rax, %rsi
    jl  .LBB0_2

Thanks for the help!

Ben

On Friday, 20 December 2019 16:48:30 UTC+2, Benjamin Legros wrote:

Reply all

Reply to author

Forward