Performance vs. llvmpipe?

311 views
Skip to first unread message

Glenn Watson

unread,
Jul 3, 2019, 4:49:38 AM7/3/19
to swiftshader
Hi,

I've been experimenting with different software rasterizers. I was hoping to see comparable performance between SwiftShader and llvmpipe/mesa, but so far llvmpipe seems much faster.

I'm wondering if this is something specific to the application, or something wrong with how I'm building SwiftShader, or if there is some other simple explanation?

I'm testing on a Linux machine with i7-4790 CPU @ 3.6 GHz. On llvmpipe, I see ~46 FPS, while the SwiftShader backend is running at ~5 FPS. I used the default llvm backend for SwiftShader, but building with the SubZero backend gives similar performance.

I have listed steps below that should reproduce the benchmark, if you have time to investigate? Or perhaps comment on whether this is unexpected and any ideas on what might be wrong in the benchmark?

Thanks

---

1) Ensure rustc is installed via https://rustup.rs (default stable compiler options are fine).

2) Clone WebRender:
    git clone https://github.com/servo/webrender/
    cd webrender/wrench

3) Run the test case with mesa + llvmpipe:
    ./script/wrench_with_renderer.py llvmpipe benchmarks/text-rendering.yaml

4) Run the test case with SwiftShader:
    Clone my local SwiftShader fork from https://github.com/gw3583/swiftshader
        - This has one commit which has a couple of local hacks I needed.
        - Or just apply https://github.com/gw3583/swiftshader/commit/ccaf35ec7f1b7794846ed970b23cf47b28f98d50 on your master.

    Build SwiftShader

    cd swiftshader/build
    ln -s libEGL.so libEGL.so.1
    export LD_LIBRARY_PATH= <path to swiftshader build/ dir>

    ./script/wrench_with_renderer.py swiftshader benchmarks/text-rendering.yaml

To compare the results:
    - While wrench is running, press 'p' to see the on-screen profiler with FPS counter.
    - To confirm which rasterizer is in use, the GL vendor string is written to the terminal, and in the title view.

Nicolas Capens

unread,
Jul 4, 2019, 12:07:43 AM7/4/19
to Glenn Watson, swiftshader
Hi Glenn,

Thanks for reporting this and providing detailed repro steps!

We've found SwiftShader to be ~2x faster than llvmpipe in the past. So seeing it be this much slower is quite unexpected. I suspect a fallback code path is taken for an uncommon operation.

We're heads down currently implementing support for Vulkan, but when I have some spare time I'll take a closer look. Please note that it's unlike we'll put effort into trying to optimize the OpenGL ES implementation at this point. But we're working towards using ANGLE on top of the Vulkan implementation eventually and we'll definitely want that to be fast!

Cheers,
Nicolas

--
You received this message because you are subscribed to the Google Groups "swiftshader" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swiftshader...@googlegroups.com.
To post to this group, send email to swift...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/swiftshader/969fe831-d2be-48c5-9eef-bb3d81233ab0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeff Muizelaar

unread,
Jul 4, 2019, 9:30:23 AM7/4/19
to Nicolas Capens, Glenn Watson, swiftshader
On Thu, Jul 4, 2019 at 12:07 AM 'Nicolas Capens' via swiftshader
<swift...@googlegroups.com> wrote:
>
> Hi Glenn,
>
> Thanks for reporting this and providing detailed repro steps!
>
> We've found SwiftShader to be ~2x faster than llvmpipe in the past. So seeing it be this much slower is quite unexpected. I suspect a fallback code path is taken for an uncommon operation.

Do you have any suggestions for how we could investigate this further?
i.e. is it easy to tell if a fallback path is being hit? Do you have
other recommendations or tips for investigating SwiftShader
performance?

-Jeff

Nicolas Capens

unread,
Jul 4, 2019, 4:46:51 PM7/4/19
to Jeff Muizelaar, Glenn Watson, swiftshader, Ben Clayton

Hi Jeff!

I've had a quick look, and it doesn't look like it's hitting a slow path. By slow path I mean fallback C++ code instead of JIT-compiled code optimized for the draw command's state. perf shows that 65% of time is spent in generated code (indicated as [unknown]), which is quite typical, and no substantial hotspots elsewhere.

Strangely I'm getting ~7 FPS when forcing to run single-threaded, and it gradually decreases when adding more threads. We're aware of sub-optimal multi-core scaling, typically resulting in plateauing at around 8-16 threads and then slowdowns due to contention,  but I've never seen it not benefit from 2 threads. Also, if the wrench dashboard is correct there's only one draw call so no contention from that and plenty of vertices and pixels to process in parallel.

Ben is a new team member with an interest in optimizing our Vulkan implementation, and since it inherited some core task scheduling from the OpenGL ES implementation, he might take a closer look at this case at some point.

Cheers,
Nicolas

Jeff Muizelaar

unread,
Jul 8, 2019, 8:29:20 PM7/8/19
to Nicolas Capens, Glenn Watson, swiftshader, Ben Clayton
I tried digging into this further by enabling PERF_PROFILE. However,
doing so broke the build, with errors about:
'AddAtomic(Pointer<Long>(&profiler.ropOperations), 4);'

Is PERF_PROFILE expected to still work?

-Jeff

Jeff Muizelaar

unread,
Jul 9, 2019, 9:43:01 PM7/9/19
to Nicolas Capens, Glenn Watson, swiftshader, Ben Clayton
I was able to get some numbers out and digging and it looks like the
problem might be that WebRender is using instancing with an instance
count of 4944. It looks SwiftShader lowers this to an individual draw
call per instance which presumably limits the amount of parallelism
that SwiftShader is able to take advantage of.

Does this explanation seem plausible?

-Jeff

Nicolas Capens

unread,
Jul 9, 2019, 11:07:23 PM7/9/19
to Jeff Muizelaar, Glenn Watson, swiftshader, Ben Clayton
Hi Jeff,

Yes, that explains it! Ben had noticed a large number of calls to our task scheduler, despite the single draw call from the app. Indeed we split instances into separate tasks, much like multiple draw calls with reuse of the same context state. It works well with instances consisting of many triangles, but not when they're only a couple of triangles.

We should be able to vastly improve this, either by implementing the attribute divisors using real divisions (not cheap but still better than causing scheduler contention), or by counting the vertices during primitive assembly and rolling over when reaching the divisor (a modulo operation would still be needed per batch of primitives). Sorry for the details, I want to write down my thoughts for when we start working on this. :-) We're getting close to wrapping up our work on Vulkan...

Cheers,
Nicolas

Jeff Muizelaar

unread,
Jul 10, 2019, 3:46:37 PM7/10/19
to Nicolas Capens, Glenn Watson, swiftshader, Ben Clayton
On Tue, Jul 9, 2019 at 11:07 PM 'Nicolas Capens' via swiftshader
<swift...@googlegroups.com> wrote:
>
> Hi Jeff,
>
> Yes, that explains it! Ben had noticed a large number of calls to our task scheduler, despite the single draw call from the app. Indeed we split instances into separate tasks, much like multiple draw calls with reuse of the same context state. It works well with instances consisting of many triangles, but not when they're only a couple of triangles.
>
> We should be able to vastly improve this, either by implementing the attribute divisors using real divisions (not cheap but still better than causing scheduler contention), or by counting the vertices during primitive assembly and rolling over when reaching the divisor (a modulo operation would still be needed per batch of primitives). Sorry for the details, I want to write down my thoughts for when we start working on this. :-) We're getting close to wrapping up our work on Vulkan...

It looks like llvmpipe and swr both just do divisions and perhaps hope
that it's a constant that llvm will optimize into a multiplication.
https://github.com/mesa3d/mesa/blob/bb14abed18638c85b7892f435b9ac26d5b62edd4/src/gallium/auxiliary/draw/draw_llvm.c#L1766
https://github.com/mesa3d/mesa/blob/187a6506a3e39ab613a9085fe01b23fb42f9aa6b/src/gallium/drivers/swr/rasterizer/jitter/fetch_jit.cpp#L594

I don't know if SubZero optimizes constant divisions into
multiplications but if it doesn't even just special casing '1' would
probably cover a lot of situations. (That's all that WebRender uses)

-Jeff

Jeff Muizelaar

unread,
Jul 13, 2019, 2:33:17 PM7/13/19
to Nicolas Capens, Glenn Watson, swiftshader, Ben Clayton
I took a closer look at what was going on and it seems like there's
probably more to this than just SwiftShader's handling of instancing.
If I disable instancing by using the -nobatch argument llvmpipe's
performance drops a little, but it's still much higher than
SwiftShaders. I also tested out OpenSWR on this test case and its
performance seems roughly in the middle between llvmpipe and
SwiftShader.

-Jeff

On Tue, Jul 9, 2019 at 11:07 PM 'Nicolas Capens' via swiftshader
> --
> You received this message because you are subscribed to the Google Groups "swiftshader" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to swiftshader...@googlegroups.com.
> To post to this group, send email to swift...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/swiftshader/CAK2XvF%3D9wcDcuttXFjOSd32BiMd_uV3W7jOPRUKpNJLYtOq9Mg%40mail.gmail.com.

Ben Clayton

unread,
Jul 15, 2019, 8:07:12 AM7/15/19
to swiftshader
Hi Jeff,

I've been looking into this, and I strongly suspect we're hitting contention in the Renderer's thread scheduling, resulting in poor performance.
I added some code to emit a trace file, which can be loaded with Chrome's chrome://tracing viewer. Your test application produces something that looks like this:

Screenshot from 2019-07-15 12-26-47.png



I'm a still investigating why this particular test causes such poor performance compared to others.

I'm currently in the middle of rearchitecting the task scheduling for our Vulkan implementation of Renderer. I'll experiment to see if any of this work can be back-ported.
At the very least this should either confirm or rule out this as the cause of the performance issues.

I'll report back my findings as soon as I can.

Cheers,
Ben
> To unsubscribe from this group and stop receiving emails from it, send an email to swift...@googlegroups.com.

Ben Clayton

unread,
Jul 16, 2019, 7:42:28 AM7/16/19
to swiftshader
An update on my findings.

I've attempted to optimize the job scheduling, which has removed any visible thread contention - with this I've got the FPS on my machine from 2.75FPS to around 6FPS. Still some way to go!

I believe there are two major performance bottlenecks at play here:

1) I think the previous assessment that our instancing implementation is hurting us is correct. Even with my optimized version each draw call submits in about 33µs, and roughly 29µs of that is applying / updating state for the individual draw instances. If we pushed the instancing further down into Renderer as a separate submission, I think this would alleviate a most of the performance overhead. Some moderate and easy wins might be found by being smarter about state updates for instances of the same submission.

2) Once the vertex processing is done, rasterization is parallelized across threads. In this particular test each draw has an exceptionally small area (number of pixels) to rasterize. Doing the full rasterization of the most common draw on a single thread takes roughly 6µs. By default, the renderer will attempt to parallelize rasterization across all the logical cores available. The overhead of distributing such a tiny amount of work across all the logical cores significantly slows things down in this case. It might be worth investigating whether we can scale the number of threads used to rasterize a draw by the number of pixels.

Here is screenshot from a new trace, with the new scheduler and rasterization limited to a single thread:

Screenshot from 2019-07-16 11-55-41.png


Some context on some of these blocks:

[update state] - time spent in this scope
Between [update state] and [scheduleDraw] - time spent coping state into the DrawCall / DrawData.
[scheduleDraw] - time spent enqueuing the work for vertex and fragment processing.
[drawElements apply] - time spent in these two functions.
[Device::bindShaderConstants] - time spent in this function.

The full trace file can be found here.

It is interesting to hear that llvmpipe still out-performs SwiftShader when using the -nobatch argument. Perhaps llvmpipe has more efficient draw-call updates?

Cheers,
Ben

Jeff Muizelaar

unread,
Jul 16, 2019, 9:44:45 PM7/16/19
to Ben Clayton, swiftshader
On Tue, Jul 16, 2019 at 7:42 AM 'Ben Clayton' via swiftshader
<swift...@googlegroups.com> wrote:
>
> An update on my findings.
>
> I've attempted to optimize the job scheduling, which has removed any visible thread contention - with this I've got the FPS on my machine from 2.75FPS to around 6FPS. Still some way to go!
>
> I believe there are two major performance bottlenecks at play here:
>
> 1) I think the previous assessment that our instancing implementation is hurting us is correct. Even with my optimized version each draw call submits in about 33µs, and roughly 29µs of that is applying / updating state for the individual draw instances. If we pushed the instancing further down into Renderer as a separate submission, I think this would alleviate a most of the performance overhead. Some moderate and easy wins might be found by being smarter about state updates for instances of the same submission.
>
> 2) Once the vertex processing is done, rasterization is parallelized across threads. In this particular test each draw has an exceptionally small area (number of pixels) to rasterize. Doing the full rasterization of the most common draw on a single thread takes roughly 6µs. By default, the renderer will attempt to parallelize rasterization across all the logical cores available. The overhead of distributing such a tiny amount of work across all the logical cores significantly slows things down in this case. It might be worth investigating whether we can scale the number of threads used to rasterize a draw by the number of pixels.

Would pushing instancing down all the way to rasterization would help
here too? Instead of having a one draw call each with a small amount
of work, we should be able to have a whole set of primitives so that
each rasterization thread is drawing all of them?

> Here is screenshot from a new trace, with the new scheduler and rasterization limited to a single thread:
>
> The full trace file can be found here.

Is the code to emit the trace file available somewhere?

>
> It is interesting to hear that llvmpipe still out-performs SwiftShader when using the -nobatch argument. Perhaps llvmpipe has more efficient draw-call updates?

I think one of the reasons that llvmpipe maintains good performance
with -nobatch is because of it being tiling deferred. I suspect this
makes it so that during pixel shading there's not really any
difference between the instanced rendering vs nobatch.

-Jeff

Nicolas Capens

unread,
Jul 17, 2019, 9:38:55 AM7/17/19
to Jeff Muizelaar, Ben Clayton, swiftshader
I've filed a tracker bug for optimizing instancing: b/137740918

Indeed I think we should concentrate on addressing the underlying issue of splitting the draw call. It's interesting that llvmpipe can handle many individual tiny draw call faster too, but that's an entirely different issue and it would go away for this use case if we keep the batched draw call whole.

--
You received this message because you are subscribed to the Google Groups "swiftshader" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swiftshader...@googlegroups.com.

To post to this group, send email to swift...@googlegroups.com.

Ben Clayton

unread,
Jul 18, 2019, 6:26:47 AM7/18/19
to swiftshader
> Is the code to emit the trace file available somewhere? 

This currently resides in a private branch that is still work in progress. However, the trace file generation is moderately stable, and consists of two files: Trace.hpp and Trace.cpp.
You should be able to grab them and use the macros defined at the bottom of Trace.cpp. I'll also try to get this up for review soon.

Cheers,
Ben

Ben Clayton

unread,
Jul 18, 2019, 6:26:49 AM7/18/19
to swiftshader
> Is the code to emit the trace file available somewhere? 

This currently resides in a private branch that is still work in progress. However, the trace file generation is moderately stable, and consists of two files: Trace.hpp and Trace.cpp.
You should be able to grab them and use the macros defined at the bottom of Trace.cpp. I'll also try to get this up for review soon.

Cheers,
Ben

On Wednesday, July 17, 2019 at 2:44:45 AM UTC+1, Jeff Muizelaar wrote:
Reply all
Reply to author
Forward
0 new messages