OSL and the GPU

Erich Ocean

unread,

Oct 3, 2010, 1:57:24 PM10/3/10

to osl...@googlegroups.com

With the LLVM shader now stable, I've been working on how to accelerate OSL with the GPU, specifically Fermi-class GPUs from nVidia. I thought I'd share with the list my current thoughts, and I'd appreciate any feedback or suggestions.

Background
==========

Open Shading Language is unique among shading languages, in that each shader invocation has two distinct stages. In the first stage, the shader "as written" is executed once for a given set of global variables. Shader evaluation can include texture lookups, procedural code, and of course the shader has access to the global variables provided by the renderer. The result of the first stage is a "radiance closure", or simply "closure".

There are currently 28 pre-build closures in OSL today, and it's easy to add new ones with a little C++. A closure includes all shading results EXCEPT for the interaction with lights. That is handled in the second stage.

(In reality, the closure result from the first stage is not a single closure, but rather a list of weighted closures. Each weighted closure has a label -- some closures contribute to diffuse, some to glossy, etc. By summing their outputs, you can compute the final value of the color.)

In the second stage, the closure list is integrated by the renderer with the lights in the scene to produce actual RGB values. The renderer is free to evaluate or "sample" the closure list from different viewing directions, lights, etc. The individual closures also provide helper methods the renderer can use to discover "good" directions to sample from, making sampling more efficient.

(OSL also provides "light path expressions" to help manage the output of the second stage, but these are beyond the scope of this email.)

Each closure list can be evaluated hundreds or thousands of times without re-runing the shader code from the first stage. Closures themselves are MUCH simpler to evaluate than stage-one shaders: they do not access textures, have no procedural code, and do not have access to the global variables from the renderer. All of these values and computations are pre-baked into the closure's "instance data".

These characteristics lead me to believe that evaluating closures on the GPU should be the first target for GPU acceleration. The first stage shaders should continue to be run on the CPU.

Evaluation Strategy
===================

How a closure is evaluation is renderer-dependent. That said, there are typically four steps:

For each closure list in the scene:
1. Choose a single closure from the list.
2. Compute an interesting direction to sample the closure.
3. Sample the closure.
4. Integrate the result into the final image.

Each of these steps could be performed on the GPU, but steps 2 and 3 make the most sense initially. (Step 4 could also be done efficiently on the GPU, but this would require more changes to the renderer.)

Broadly speaking, with CUDA (nVidia's language/API for general-purpose computing on the GPU), you begin by copying data from the host memory to the device memory, then issue a CUDA "kernel" to run (passing in a pointers to device memory, and other arguments), and then wait for the kernel to finish computing. When the kernel completes, you then copy the results back from device memory to host memory, where you can operate on them further from normal CPU code.

To accelerate the above with CUDA, I propose the following overall approach:

Setup (done once):
------------------

First, each closure in a closure list is associated with an index into an array containing instance data for closures of that type. There are 28 such arrays (one for each closure type). A weight (if any) that applies to that closure is placed next to the closure's instance data in memory. A closure's index into the array is remembered on the CPU side. Indexes must be less than 24 bits long, so the maximum number of closures of a given type is limited to ~16.8 million.

Second, the above arrays are copied from host memory to device memory, where they will persist for many kernel executions.

Third, the renderer creates three "pinned" memory regions -- arrays of bytes that are shared between the host and device. The arrays are called the "input" array, the "args" array, and the "output" array. All three arrays contain n elements, where "n" is the number of samples the renderer wants to execute in a single batch. (This value is limited by the memory available on the GPU.) The size of the input array in bytes in 8n, the args array in bytes is ???n, and the output array in bytes is ???n.

Batch shading (for as long as the renderer wishes to shade):
------------------------------------------------------------

The renderer loops over each closure list in the scene and does step 1 above ("Choose a single closure from the list.") The result is appended to the input array with the first byte represented the shader type (an integer from 0-27), the next three bytes indicated the offset for that closures instance data in the array from Setup step one, and the final four bytes are a pointer to the place in the arguments array where the args will be place. The renderer then places any arguments needed to do steps 2 and 3 in the arguments array (the layout of these arguments is known by the closure evaluation code).

Next, the renderer sorts the input array using Duane Merrill's high-performance key-value radix sort. [1] (The radix sort sorts consistently at over 1 billion keys/sec on a GTX 480.) The result is that the input array is sorted first by closure type, then by instance data index within that closure type. Both are critical to getting high thread occupancy in warps, and coalescing memory loads and stores as much as possible. (Data is written into the output array in *sorted* order.)

With the input sorted and ready for optimal execution, an uber-kernel that evaluates steps 2 and 3 in one pass is issued to the device. The uber-kernel is structured like a switch statement and takes advantage of the program stack in Fermi devices. Here's roughly what it looks like:

// compute the data below based on threadId, input, args pointers, etc.
int closure_type;
void *closure_args;
vaid *closure_data;
vaid *closure_output;

// run the correct closure code
switch (closure_type) {
case DIFFUSE:
Vec3 foo, bar; // as much data as we need for the given closure type

diffuse_choose (closure_data, closure_args, &foo, &bar);
diffuse_sample (closure_data, closure_args, closure_output, &foo, &bar);
break;
case MICROFACET_GGX:
/* similar */
}

This uber-kernel is computationally efficient because within a warp, most warps will execute the identical branch of the switch statement. Only those warps that execute two (or more) closure types will have a thread divergence; this is guaranteed to happen at most 28 times across all warps in the batch.

When the uber-kernel returns, all of the closures will have placed their results in output. The order the results are placed is the *sorted* order of input, so the render must loop over input and access the output array based on that. (The input array is key-value for this reason -- the value in the input array can be used to determine which closure the results at the iteration index apply to).

Once the output array has finished processing the output array, the input and args array can be re-populated for the next batch. At the expense of memory usage, the inpu, args and output arrays can be duplicated so that they can be populated while the kernel is running, overlapping CPU and GPU execution almost entirely.

Completion:
-----------

Once all of the closures have been evaluated (using whatever metric the renderer chooses), the memory can be deallocated. Alternatively, a different set of closures can be placed on the GPU and batch executed.

-----

The above approach should be pretty efficient on Fermi class devices. The main overhead from a computation stand-point vs. immediate shading in the sorting step, but this is pretty fast nowadays on the GPU (and 4-5 times faster than the equivalent sorting on Intel's fastest quad-core CPU).

Thoughts?

Best, Erich

Erich Ocean

unread,

Oct 3, 2010, 4:04:14 PM10/3/10

to OSL Developers

[1] http://back40computing.googlecode.com/svn/wiki/documents/RadixSortTR.pdf

To give an idea of the extra cost of GPU execution, issuing a "batch"
of one million closures to evaluate would take ~1.25ms total in the
sorting stage. (Sorting is the primary computational cost the GPU
incurs in the above approach over immediate CPU evaluation.)

With a batch size of one million, the number of warps that would be
executing two or more closure types at the same time is 0.09%, so
slowdowns due to that are basically non-existent. (Without the
sorting, almost every warp would diverge very badly since each thread
would be executing a different closure.) Since Fermi-class devices
have a normal, CPU-like coherent L2 cache, the efficiency of loading
per-closure instance data should have the best possible cache
behavior.

And because each warp in the uber-kernel is evaluating the same code,
we get fully efficient use of the execution units available. An nVidia
GTX 480 has 480 computational cores. So compared to a quad-core GPU,
we're able to execute 120 times more closures simultaneously. Even
with less than perfect speed-up, I expect closure evaluation on the
GPU to be MUCH faster than the equivalent closure evaluation done on
the CPU.

Implementation-wise, it's much simpler to implement a couple dozen
closures on the GPU than the whole LLVM shading engine we use in stage
one.

Best, Erich

Jared Hoberock

unread,

Oct 3, 2010, 4:17:51 PM10/3/10

to osl...@googlegroups.com

You might consider individual kernels for each closure type rather than going to the trouble of building a monolithic uberkernel. I agree that a sorted dataset will avoid warp divergence penalties, but an uberkernel would suffer occupancy penalties individual kernels could avoid. Since your input will be sorted, finding where each kernel should begin and end its work will be easy. Moreover, because kernel launch time is negligible and Fermi GPUs can execute multiple kernels in parallel, there shouldn't be any problems keeping the machine busy.

--
You received this message because you are subscribed to the Google Groups "OSL Developers" group.
To post to this group, send email to osl...@googlegroups.com.
To unsubscribe from this group, send email to osl-dev+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/osl-dev?hl=en.

Erich Ocean

unread,

Oct 3, 2010, 7:49:11 PM10/3/10

to osl...@googlegroups.com

Great point! I keep forgetting Fermi can have multiple kernels in-flight.

Another benefit of using separate kernels beyond reduced-occupancy is the easy of adding new closure types, plus a mild gain in efficiency by eliminating the cost of the switch statement in the uber-kernel, and only issuing calls for closures types that will actually be run. The code footprint is lower, so icache misses should be lower too.

-E

Erich Ocean

unread,

Oct 4, 2010, 2:47:31 PM10/4/10

to OSL Developers

I have three questions for the group about the proposal:

1. Is execution of the closures themselves considered "in-scope" for
the OSL project?

2. Is batching closures execution something that would work with the
renderers that are considering adopting OSL?

3. What percentage of time on existing production frames is spent in
stage one vs. stage two?

The latter in-particular seems critical. Any feedback/thoughts are
appreciated.

Christopher Kulla

unread,

Oct 6, 2010, 9:50:28 PM10/6/10

to osl...@googlegroups.com

On Mon, Oct 4, 2010 at 11:47 AM, Erich Ocean <erich...@me.com> wrote:
> I have three questions for the group about the proposal:
>
> 1. Is execution of the closures themselves considered "in-scope" for
> the OSL project?
>

Its a bit of a grey area, but currently the answer is "no". Closures
can define things as simple as an emissive surface (return a color), a
brdf, or more complicated things like subsurface scattering. They only
hold the parameters required for a renderer to apply the necessary
math. It says nothing about _how_ the effect will be achieved.

So it should be clear that "closure evaluation" is highly subjective,
not only from renderer to renderer but also from closure to closure.
It will be very different depending on the actual algorithm you are
using for each term of the lighting equation. Even when just thinking
about BRDFs, different sampling/integration strategies will require
different things from your closures. I don't think it really makes
sense to talk about closure evaluation as a standalone step - it
deeply connected with how you are generating the frame.

Just to give a concrete example: the simple shader Ci = diffuse(N);
will trigger ray tracing on every pixel in our renderer, but someone
else's renderer may trigger an irradiance cache lookup (which may
interpolate and not trace any rays at all, and therefore not require
sampling any directions).

> 2. Is batching closures execution something that would work with the
> renderers that are considering adopting OSL?

Again - this depends on how you are planning to do the sampling and
light integration. For us, the current implementation is a pretty
standard ray tracer that processes ray hits one at a time. Moreover,
not all closure calculations can be batched, or need to be. A closure
that just returns a color doesn't need much "batching" (although I
suppose this depends on the overall architecture of your system, the
batching could be there for other reasons).

>
> 3. What percentage of time on existing production frames is spent in
> stage one vs. stage two?

We are still collecting numbers on this, but I doubt we'll find a
constant ratio. This is going to vary significantly based on how stage
2 works (could be tracing rays, could be traversing a point cloud,
something more exotic ...). This could even vary based on how many
features of the renderer a particular shader is invoking (diffuse,
glossy, refraction, subsurface, etc ...). Its also going to vary based
on shading complexity vs. geometric complexity, which is something
that different productions may have a different balance of.

-Chris

>
> The latter in-particular seems critical. Any feedback/thoughts are
> appreciated.
>

Erich Ocean

unread,

Oct 6, 2010, 11:33:57 PM10/6/10

to osl...@googlegroups.com

>> 1. Is execution of the closures themselves considered "in-scope" for
>> the OSL project?
>
> Its a bit of a grey area, but currently the answer is "no".

> [snip]

Thanks for the detailed explanation. I'm still coming to grips with how OSL is meant to be used, and your response clarified things a lot. It sounds like Sony is using OSL in roughly the same way that PBRT handles its own materials.

I had been working on adapting minilight [1] to OSL coding standards, so that a basic testrender executable with something like a default material ball and lighting scene would be available as part of the OSL distro, and also serve as an example of integrating OSL with a renderer. But after reading your email, I can see that's not really the direction OSL is heading.

It looks like the only thing I'll have to contribute is Ptex support in the texture() call, and since Larry has already started working on adapting OIIO for Ptex, that is looking less and less necessary.

Which is all, of course, a good thing: it means OSL as a project is maturing. Better to spend time making pretty pictures with OSL than plunking around with the shading engine internals. :-)

Best, Erich

[1] http://www.hxa.name/minilight/

Doug Epps

unread,

Oct 6, 2010, 11:58:07 PM10/6/10

to osl...@googlegroups.com

Hi Folks,

Long time listener, first time caller here.

Erich, I think some sort of example renderer is critically important to have in the distro.

I've been corresponding with Larry and Solomon off list to try to get myself to a level of understanding sufficient to make just such an example renderer (testrace.cpp, if you will).

At least to this fellow, how to use the library in a 'real' setting is non-obvious without some hand-holding. Having something dead-simple that actually uses the closures will go a long way to getting new people up to speed. I think minilight is even overkill. I've been trying to build my understanding using std::vector<> as an acceleration structure and Imath for ray intersections. That way there's nothing to come up to speed on other than OSL itself.

So if you have something 'real' that shows how to use this sucker, I say post-it !

-doug

PS - thanks for all the incredibly useful reading from this list !

Christopher Kulla

unread,

Oct 7, 2010, 2:51:22 AM10/7/10

to osl...@googlegroups.com

Absolutely. Just like we have testshade, we should have something like
"testrender" that is a very simple toy ray tracer that demonstrates
the basics of invoking a shader (with derivatives, etc ...) and making
use of the closures returned by OSL for lighting calculations. It
probably won't have displacement, volumes or subsurface scattering,
but at least light sources (with light attenuators) and BRDFs
(diffuse/glossy/singular). This should be good enough to get people
over the initial hump. It should also address some of the basic
concerns like how to safely use the library in a multi-threaded
environment.

The key will be balancing simplicity vs. features vs. performance.
This doesn't need to be a production quality system, but it should let
users write their first OSL shaders and get decent images back in a
reasonable amount of time.

With the batched style of shader execution this project was going to
be fairly ambitious. Now that the LLVM backend lets you run shaders
one point at a time, this task should be much easier.

-Chris

Viewon01

unread,

Oct 7, 2010, 6:25:51 AM10/7/10

to OSL Developers

Hi Erich,

Where can we download a MiniLight version integrated to OSL ? If
available ?

I'm very interested to take a look at this example, for sure it will
help me.

thanks

Simon Bunker

unread,

Oct 7, 2010, 8:48:15 AM10/7/10

to OSL Developers

It would definitely be great to have a reference renderer that you can
give some sort of scene description and get back an image. You mention
PBRT - which does seem to already have all the necessary components.
How hard would it be to write a PBRT material plugin to shade using
OSL? However anything would be great - it could always be expanded on
later.

Simon

Christopher Kulla

unread,

Oct 7, 2010, 1:26:59 PM10/7/10

to osl...@googlegroups.com

PBRT integration should be do-able, although its not something that
could be packaged with the OSL distribution itself (their code is
GPL). So I still feel like something of roughly the scope of minilight
is a better first step.

-Chris

Larry Gritz

unread,

Oct 8, 2010, 11:52:11 AM10/8/10

to osl...@googlegroups.com

All along, I was thinking of something only minimally more complex than testshade -- for example, a "renderer" hard-coded to ray trace a single sphere sitting on a quad (no ray acceleration, tessellation, etc.) lit by a point light.

If somebody wants to integrate OSL into a more robust renderer, such as PBRT, by all means do so. But even if it were the right kind of license, I don't think we want the OSL library itself to turn into a robust renderer or absorb thousands of lines of external code just for the sake of having "shader ball" tests.

I do agree that we are truly desperate for some example code that shows a simple but working example of how to invoke shaders, get the closure results back, and integrate them to get the camera ray radiance values. Something along those lines would be a truly valuable contribution.

I haven't looked at the Minilight code, but from the description, it looks like it could fit the bill.

-- lg

--
Larry Gritz
l...@imageworks.com

Reply all

Reply to author

Forward