Shader path a lot slower

Anders Backman

unread,

Apr 8, 2021, 3:42:28 AM4/8/21

to OpenSceneGraph Users

Hi all.

Have you all noticed that the time for Draw is dramatically more expensive for a shader based scene compared to fixed functionality?

I have the most basic shader possibility comparing to a fixed functionality rendering pipeline. The difference in Draw call is:

1.3 ms (fixed) compared to 3.74ms (shader) in a scene with 300 boxes. 7776 vertices, 3886 triangles.

RTX2080. Windows 10. OSG 3.6.4.

It feels kind of bad to start with a huge overhead already from the start.

For larger scenes the difference shrinks. Still very much noticeable.

With a stupid scene with 10000 drawables (same boxes) difference between "empty" shader is < 10%.

But with a diffuse/specular shader, the difference jumps up to 100% compared to fixed pipeline.

What is your experience in this?

draw.vert:

#version 440

layout (location = 0) in vec4 osg_Vertex;

layout (location = 1) in vec3 osg_Normal;

uniform mat4 osg_ModelViewProjectionMatrix;

void main()

{

gl_Position = osg_ModelViewProjectionMatrix * osg_Vertex;

}

draw.frag:

#version 440

layout (location = 0) out vec4 outColor;

void main (void)

{

outColor = vec4(1,0.4,1,1);

}

Robert Osfield

unread,

Apr 8, 2021, 4:05:28 AM4/8/21

to OpenSceneGraph Users

Hi Anders,

On Thu, 8 Apr 2021 at 08:42, Anders Backman <backm...@gmail.com> wrote:

Have you all noticed that the time for Draw is dramatically more expensive for a shader based scene compared to fixed functionality?

It can be faster or slower, it depends upon the nature of your scene graph and how you manage state. If you have lots of uniforms and lots of different shaders then this could introduce a high CPU overhead.

In general for a shader based subgraph the CPU overhead tends to be higher. However, shaders gives you lots of opportunities to batch data, use things like instancing etc, so you can overall end up with a lower CPU overhead.

I have the most basic shader possibility comparing to a fixed functionality rendering pipeline. The difference in Draw call is:

1.3 ms (fixed) compared to 3.74ms (shader) in a scene with 300 boxes. 7776 vertices, 3886 triangles.

RTX2080. Windows 10. OSG 3.6.4.

It feels kind of bad to start with a huge overhead already from the start.
For larger scenes the difference shrinks. Still very much noticeable.
With a stupid scene with 10000 drawables (same boxes) difference between "empty" shader is < 10%.

The OSG/OpenGL combination doesn't handle fine grained scene graphs at all well - hence my batching comment above. Could you try instancing to reduce the CPU overhead.

Or... try the VulkanSceneGraph/Vulkan. It sounds like a simple enough scene graph to replicate quite easily in the VSG, it may well be an order of magnitude faster for the type of usage case even without using techniques like instancing to batch the data.

Cheers,

Robert.

Anders Backman

unread,

Apr 9, 2021, 2:02:16 AM4/9/21

to OpenSceneGraph Users

It looks like for each MatrixTransform you add to the scene, you get TWO calls to glMatrixUnform4fv, one for the osg_ModelViewMatrix and one for the osg_ModelViewProjectionMatrix

Compared to the fixed pipeline I have 1500 gl calls during one frame, and in fixed only 450.

That should make for some of the decrease in performance.

Other than that a profiling does not give anything specific CPU related when running the application.

It is just that the draw call can be 2-4 times slower (perhaps due to the number of gl calls).

But the GPU time is also higher, about 100% with the same complexity (and a really simple shader).

/A

Robert Osfield

unread,

Apr 9, 2021, 3:13:47 AM4/9/21

to OpenSceneGraph Users

Hi Anders,

On Fri, 9 Apr 2021 at 07:02, Anders Backman <backm...@gmail.com> wrote:

It looks like for each MatrixTransform you add to the scene, you get TWO calls to glMatrixUnform4fv, one for the osg_ModelViewMatrix and one for the osg_ModelViewProjectionMatrix
Compared to the fixed pipeline I have 1500 gl calls during one frame, and in fixed only 450.
That should make for some of the decrease in performance.

Interesting stats. I guess we added more complexity to the OSG state tracking one could figure out whether both osg_ModelViewMatrix and osg_ModelViewProjectionMatrix are required, this would require quite a few changes to the core OSG to juggle this. This would increase the OSG CPU overhead so not a free addition even it was a net gain for some applications.

Other than that a profiling does not give anything specific CPU related when running the application.
It is just that the draw call can be 2-4 times slower (perhaps due to the number of gl calls).

CPU bottleneck is an ever present problem with OpenGL, which has got worse with using shaders. There's a reason why Vulkan was created and why I started to work on the VulkanSceneGraph.

But the GPU time is also higher, about 100% with the same complexity (and a really simple shader).

The GPU is only as fast as the pipe that feeds it, so if the CPU side is bogged down the GPU will stall and take more time. Changing state on the GPU also has a significant impact on performance.

Batching geometry and state helps with CPU and GPU overheads but can't fix it completely. The biggest performance gain with using shaders that you can start batching some scenes much more aggressively using techniquie like instances. In you scene this is probably the way to go.

Or... just try the VulkanSceneGraph. If you just have a bunch of small geometries that you are controlling with matrices set on the CPU then the VSG will blow the OSG out of water. Both Vulkan and the VSG are very well optimized for this type of load. Vulkan has a "Push Constants" that are very lightweight way to pass regularly changing values, it has significantly lower overhead than using uniforms to do the same. The VSG uses push constants for send modelview matrices to the GPU.

The VSG helps by making culling an explicit task - you place CullGroups above any subgraphs that you want to enable view frustum culling for rather than being enabled for all nodes all the time unless explicitly disabled. This dramatically cuts the number of conditionals during traversal as well as focusing the culling task to just nodes where it's known that it will be important. The VSG also allows you to say that a subgraph betlow a MatrixTransform doesn't require any culling so the view frustum doesn't need to be transformed into the local coordinate frame - this is another important optimization that lowers the CPU overhead.

A final important part of the performance puzzle is that Vulkan has all the command and data preparation done in the application user thread, and then passed as a block (command buffer) with a single submission call. You can prepare multiple command buffers in a parallel and submit them together.

I could go on... the Vulkan and VSG have lots of tricks that radically change how much performance you can get out of the whole CPU/GPU system.

I guess I need to write a short VSG vs OSG example that illustrates this type of task, it's an example of worst case scenario for OpenGL/OSG which won't make Vulkan/VSG even break a sweat.