Hi Nicolas,
Thank you so much for the quick response. We are looking at optimizing glmark2-es2 benchmark,
BaseMarkGPU, Madagascar3d, AnTuTu-3D.
I agree that if we have a 4-component vector in the Shader with a structure like Vector4f with 4 vectors of 8 elements each , we do no have to use
shuffle Instructions. I have some confusion on how do we form Vector4f with x, y, z, w containing 8 elements each ?
Currently in Shader/SetupRoutine.cpp, in generate function, v[interpolant][component] is getting populated using the setUpGradient function which is filling as follows,
......
In the above code, to read 8 values, would it be correct to do as below
Float4 i1, Float4 i2.
i1.x = *Pointer<Float>(v0 + attribute);
i1.y = *Pointer<Float>(v1 + attribute);
i1.z = *Pointer<Float>(v2 + attribute);
i1.w = 0;
i2.x = *Pointer<Float>(v0 + attribute+4);
i2.y = *Pointer<Float>(v1 + attribute+4);
i2.z = *Pointer<Float>(v2 + attribute+4);
i2.w = 0;
<rest of the code>
*Pointer<Float8>(primitive + planeEquation) = A; // A will Float8 formed from i1 and i2
*Pointer<Float8>(primitive + planeEquation + 32) = B;
*Pointer<Float8>(primitive + planeEquation + 64) = C;
Also, in PixelProgram.cpp, fetchRegister function, we have the below case where we are duplicating scalar values to form vector4f.
// This is used for all literal types, and since Reactor doesn't guarantee
// preserving the bit pattern of float constants, we must construct them
// as integer constants and bitcast.
For example, programmer writes vec4(a, b, c d), we would have x, y, z, w of reg as aaaa, bbbb, cccc, dddd respectively.
With Vector4f having 8 elements each for x, y, z, w we would have aaaaaaaa, bbbbbbbb, cccccccc, ddddddddd.
How would this help in performance ?
And with this change, writeColor should write 8 pixels at once to framebuffer where as currently it writes 4 pixels. Any pointers on how to do that ?
Note : I'm referring to code in Shader as I was not aware that it will be deprecated. I guess the logic in Shader and Pipeline is similar .
Thanks in Advance!