Struggling to build a performant sRGB blend function

173 views
Skip to first unread message

Ben Harper

unread,
Dec 20, 2019, 9:24:14 AM12/20/19
to Intel SPMD Program Compiler Users
I'm trying to write a function using ISPC, which takes as input:
  • A scanline of a glyph, as rendered by Freetype. This is a uint8 alpha mask.
  • A target RGBA (uint8 x 4) surface.
  • A 3-element color (RGB) to draw onto the surface, masked by the glyph's alpha channel
and does the following:
  • Composite the glyph scanline onto the target surface, using the OVER operator, and with gamma-correct blending (ie transform source and destination from sRGB to linear float, then perform the blending in linear space, then transform back to sRGB, and write out to target surface).
This goal here is straightforward - it's just the final step needed to consume Freetype output, and show it on the screen.


I'm compiling with: ispc --target=avx2-i32x4 --opt=fast-math

It feels like I'm doing this wrong. I get a bunch of gather/scatter warnings, and the generated code of the inner loop feels a little too long, and I get the feeling I could build it quite a bit better if I hand-crafted it.

These are the ispc compiler warnings:

blend.ispc:21:17: Performance Warning: Conversion from unsigned int to float is slow. Use "int" if possible 

float alpha = glyph[alphaChan] / 255.0f;

                ^^^^^^^^^^^^^^^^


blend.ispc:22:41: Performance Warning: Gather required to load value. 

dst[i] = float_to_srgb8((1 - alpha) * sRGBToLinear[dst[i]] + alpha * color[i & 3]);

                                        ^^^^^^^^^^^^^^^^^^^^


blend.ispc:22:72: Performance Warning: Gather required to load value. 

dst[i] = float_to_srgb8((1 - alpha) * sRGBToLinear[dst[i]] + alpha * color[i & 3]);

                                                                       ^^^^^^^^^^^^


Can anybody suggest a better paradigm?

Thanks,
Ben

Benjamin Legros

unread,
Dec 20, 2019, 9:41:58 AM12/20/19
to ispc-...@googlegroups.com

The lookup table will cause ispc to emit gather instructions, because the index of the lookup is varying. There's not really anything you can do about it, and unless you're targeting AVX2 and above, there's not hardware support for that on SSEx and AVX. Also, having the alpha channel interleaved with rgb will make the lanes interdependent on each other, which is not particularly simd friendly.

If you're not afraid of a little bit of approximations, you may consider replacing sRGB_to_linear with a square and linear_to_sRGB with a square root (that is, this approximates a 2.2 power function, which itself approximates the actual sRGB function.)


--
You received this message because you are subscribed to the Google Groups "Intel SPMD Program Compiler Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ispc-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ispc-users/9018ddcd-16a2-4f41-94df-5777319d0b23%40googlegroups.com.

Benjamin Legros

unread,
Dec 20, 2019, 9:48:30 AM12/20/19
to ispc-...@googlegroups.com

The lookup table will cause ispc to emit gather instructions, because the index of the lookup is varying AND not contiguous.

My bad.

Ben Harper

unread,
Dec 20, 2019, 2:05:07 PM12/20/19
to Intel SPMD Program Compiler Users
Thanks for the advice! I'll give those things a try and report back here how it goes.

Ben Harper

unread,
Dec 24, 2019, 2:39:43 AM12/24/19
to Intel SPMD Program Compiler Users
So, I ended up pre-expanding the alpha channel 4x, and then this simpler blend function:

export void BlendSRGB(uniform int nPix, uniform float color[], uniform uint8 glyph[], uniform uint8 dst[]) {
foreach (i = 0 ... nPix * 4) {
float alpha = glyph[i] / 255.0f;
dst[i] = ToSRGBCheap((1 - alpha) * FromSRGBCheap(dst[i]) + alpha * color[programIndex]);
}
}

inline float FromSRGBCheap(uint8 v) {
float f = (float) v * (1.0f / 255.0f);
return f * f;
}

inline uint8 ToSRGBCheap(float v) {
return (uint8) (sqrt(v) * 255.0f);
}

and the resulting code looks reasonably good:

.LBB0_2: # %foreach_full_body
# =>This Inner Loop Header: Depth=1
movd (%rdx,%rsi), %xmm6 # xmm6 = mem[0],zero,zero,zero
punpcklbw %xmm1, %xmm6 # xmm6 = xmm6[0],xmm1[0],xmm6[1],xmm1[1],xmm6[2],xmm1[2],xmm6[3],xmm1[3],xmm6[4],xmm1[4],xmm6[5],xmm1[5],xmm6[6],xmm1[6],xmm6[7],xmm1[7]
punpcklwd %xmm1, %xmm6 # xmm6 = xmm6[0],xmm1[0],xmm6[1],xmm1[1],xmm6[2],xmm1[2],xmm6[3],xmm1[3]
cvtdq2ps %xmm6, %xmm6
mulps %xmm2, %xmm6
movaps %xmm8, %xmm7
subps %xmm6, %xmm7
movd (%rcx,%rsi), %xmm3 # xmm3 = mem[0],zero,zero,zero
punpcklbw %xmm1, %xmm3 # xmm3 = xmm3[0],xmm1[0],xmm3[1],xmm1[1],xmm3[2],xmm1[2],xmm3[3],xmm1[3],xmm3[4],xmm1[4],xmm3[5],xmm1[5],xmm3[6],xmm1[6],xmm3[7],xmm1[7]
punpcklwd %xmm1, %xmm3 # xmm3 = xmm3[0],xmm1[0],xmm3[1],xmm1[1],xmm3[2],xmm1[2],xmm3[3],xmm1[3]
cvtdq2ps %xmm3, %xmm3
mulps %xmm2, %xmm3
mulps %xmm3, %xmm3
mulps %xmm7, %xmm3
mulps %xmm0, %xmm6
addps %xmm3, %xmm6
sqrtps %xmm6, %xmm3
mulps %xmm4, %xmm3
cvttps2dq %xmm3, %xmm3
pand %xmm5, %xmm3
packuswb %xmm3, %xmm3
packuswb %xmm3, %xmm3
movd %xmm3, (%rcx,%rsi)
addq $4, %rsi
cmpq %rax, %rsi
jl .LBB0_2

Thanks for the help!

Ben

On Friday, 20 December 2019 16:48:30 UTC+2, Benjamin Legros wrote:
Reply all
Reply to author
Forward
0 new messages