AVX in Pixel Routine

80 views
Skip to first unread message

podeti vishal

unread,
Dec 18, 2020, 12:54:31 AM12/18/20
to swiftshader
Hi All,

I'm exploring ways to implement avx support in Pixel Routine. As per my understanding, in PixelRoutine, each loop of calculation based on 4 pixels . Input data is formatted as an array of Float4.  

Considering the mul primitive for example in Shader, which takes Vector4f src as input where src.x, src.y, src.z, src.w can be considered as 4 xmm registers, we can create a new data type Float8, use create Shufflevector to form src_xy from src.x, src.y and call mul on ymm regisers, use create Shuffle Vector again to get the output to x, y, z, w .

input -> Shuffle Vector -> optimized primitive -> Shuffle Vector -> output . 

But this approach did not yield any benefit as createShuffleVector consumes too many cpu cycles. 

Another approach is to optimize the whole pipeline with input being in the form of Vector8f  src which would have src.xy, src.zw
interpolate->applyshader->writeoutColor 

But in this approach also for some functions such as reflect, refract, dot, normalize, determinant, forward, where 2 or three components are being operated on, we would again need createShuffleVector here. 

Also in the applyShader function , while updating the values of registers r, v from d, we have 
                                                if(dst.x) r[dst.index].x = d.x;
                                                if(dst.y) r[dst.index].y = d.y;
                                                if(dst.z) r[dst.index].z = d.z;
                                                if(dst.w) r[dst.index].w = d.w;

In this case also, we would need unpack/ShuffleVector operations which will give a performance hit again. 

Can you please suggest what would be a better approach to do math operations in ShaderCore Primitives with avx ? I'm a newbie to swiftshader and just putting my thoughts here.  Appreciate any inputs on how this could be done efficiently. 
Thanks in advance.

Nicolas Capens

unread,
Dec 18, 2020, 10:42:07 PM12/18/20
to podeti vishal, swiftshader
Hi Podeti,

Thanks for your interest in SwiftShader! What use case do you have in mind for it?

Support for "wide" vectors is something I've wanted to add for many years now, but never got around to. We may finally get started on it next year.

I noticed you're referring to the src/Shader/PixelRoutine.cpp source file. This is actually part of our legacy OpenGL ES implementation's code stack. It will be deleted soon, as our Vulkan implementation is superior and backward compatibility for GLES is provided through the ANGLE project.

The new (SPIR-V) shader implementation is in the src/Pipeline directory. The implementation of arithmetic shader instructions is very similar to those we used for OpenGL though. The key to high performance with wide vectors is to keep things in structure-of-arrays form. With 256-bit vectors we'd process 8 pixels in parallel, and each scalar in the shader source corresponds to one 32-bit lane in the SIMD vectors. So a 4-component vector in the shader becomes a structure like Vector4f with 4 vectors of 8 elements each. No shuffles are required for performing shader arithmetic in this form. A scalar multiplication in the source shader becomes one vector multiplication of 8 element-wise multiplications. This avoids any need for shuffle instructions (the exception being when reading/writing resources in regular array-of-structures form).

To support CPUs with either 128-bit, 256-bit, or even wider SIMD vectors without writing a lot of duplicate code, I'm planning to abstract away the actual width where possible. Thus no Float8 type would be introduced. Instead, a SIMD::Float type has a width depending on what's supported by the CPU. We've already introduced this type in our Vulkan code, but at the moment it's still fixed at being 4 lanes wide.

I hope that helps. Please note that we're still heads-down in the process of deprecating the OpenGL ES implementation by ensuring the ANGLE + SwiftShader Vulkan solution is better on all fronts and gets integrated in Chrome and Android.

Cheers,
Nicolas

--
You received this message because you are subscribed to the Google Groups "swiftshader" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swiftshader...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/swiftshader/9f77fbfe-e9a5-4b10-9888-3a769c46aec3n%40googlegroups.com.

podeti vishal

unread,
Dec 21, 2020, 7:26:30 AM12/21/20
to swiftshader
Hi Nicolas, 

Thank you so much for the quick response.  We are looking at optimizing glmark2-es2 benchmark, BaseMarkGPU, Madagascar3d, AnTuTu-3D. 

I agree that if we have a 4-component vector in the Shader with a structure like Vector4f with 4 vectors of 8 elements each , we do no have to use
shuffle Instructions. I have some confusion on how do we form Vector4f with x, y, z, w containing 8 elements each ? 

Currently in Shader/SetupRoutine.cpp, in generate function, v[interpolant][component] is getting populated using the setUpGradient function which is filling as follows,

......

In the above code, to read 8 values, would it be correct to do as below 
Float4 i1, Float4 i2.
                                i1.x = *Pointer<Float>(v0 + attribute);
                                i1.y = *Pointer<Float>(v1 + attribute);
                                i1.z = *Pointer<Float>(v2 + attribute);
                                i1.w = 0;

                                i2.x = *Pointer<Float>(v0 + attribute+4);
                                i2.y = *Pointer<Float>(v1 + attribute+4);
                                i2.z = *Pointer<Float>(v2 + attribute+4);
                                i2.w = 0;
                                <rest of the code>        

                              *Pointer<Float8>(primitive + planeEquation) = A; // A will Float8 formed from i1 and i2 
                              *Pointer<Float8>(primitive + planeEquation + 32) = B;
                              *Pointer<Float8>(primitive + planeEquation + 64) = C;

Also, in PixelProgram.cpp, fetchRegister function,  we have the below case where we are duplicating scalar values to form vector4f. 

// This is used for all literal types, and since Reactor doesn't guarantee
// preserving the bit pattern of float constants, we must construct them
// as integer constants and bitcast.

For example, programmer writes vec4(a, b, c d), we would have x, y, z, w of reg  as aaaa, bbbb, cccc, dddd respectively.
With Vector4f having 8 elements each for x, y, z, w we would have aaaaaaaa, bbbbbbbb, cccccccc, ddddddddd. 
How would this help in performance ?

And with this change, writeColor should write 8 pixels at once to framebuffer where as currently it writes 4 pixels. Any pointers on how to do that ? 

Note : I'm referring to code in Shader as I was  not aware that it will be deprecated. I guess the logic in Shader and Pipeline is similar . 

Thanks in Advance!

Nicolas Capens

unread,
Jan 8, 2021, 12:37:35 AM1/8/21
to podeti vishal, swiftshader
Hi Podeti,

The SetupRoutine code currently processes one primitive at a time, using Bresenham's algorithm for edge tracing, which isn't very suitable for optimization by wider SIMD. Our plan is to rewrite it to use half-space coverage based rasterization: https://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/readings/olano97_homogeneous.pdf. After that the gradient setup can be rewritten to use arbitrary SIMD width.

Indeed for something like fetchRegister's handling of FLOAT4LITERAL the values would be broadcast to 8 elements. Note that the number of instructions doesn't reduce compared to the 4-wide code, but you'd be processing 8 pixels in parallel for every loop iteration instead of 4, doubling the effective throughput.

Framebuffer writes have to also be entirely rewritten for the new SIMD width (every transition from structure-of-arrays to arrays-of-structures or vice versa really). Ultimately we plan to not store the framebuffer in linear array-of-structures layout, but with every set of 8 components stored linearly to avoid the transpose on read and write. Then it only needs one conversion on present. SwiftShader historically supported that (we called it quad-layout and there are some remnants of it in the GLES code) but the Vulkan implementation doesn't have any trace of it left.

Cheers,
Nicolas

On Wed, Jan 6, 2021 at 12:05 AM podeti vishal <podeti...@gmail.com> wrote:
Hi Nicolas, 

Happy new year!! 
Can you please update on the below query if I'm thinking in the right direction? Thanks in Advance!
Reply all
Reply to author
Forward
0 new messages