--
You received this message because you are subscribed to the Google Groups "OSL Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osl-dev+u...@googlegroups.com.
To post to this group, send email to osl...@googlegroups.com.
Visit this group at https://groups.google.com/group/osl-dev.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to a topic in the Google Groups "OSL Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/osl-dev/rbrJ_xJM0hU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to osl-dev+u...@googlegroups.com.
Their is an extra diffuse bounce, like in Arnold.
The light exposure in RenderMan is a power function, not linear like Maya lights.
The setup of the BXDF is doing double sided shading. Setting it to a single sided mode is faster.
We'll put together a better setup for RenderMan and send it your way, (we are busy preparing our 21.0 release of RenderMan at the moment).
Also, with our adaptive sampling, we converge faster by intentionally not throwing as many rays; after all in the end it's about speed to convergence. In our case denoise can make RenderMan even faster; our production metric is about speed to denoised convergence -- which is crucial to getting frames up on a big screen. I'd love to see some comparisons to a given ground truth RMS error metric of all 3 with all the features brought to bear.
But this is an OSL mailing list, not a my renderer is better than yours list. I'll e-mail links to and improved RenderMan scene off-list.
Thanks, that test is cool.
The setup of the BXDF is doing double sided shading. Setting it to a single sided mode is faster.
The arnold one has the same issue - is there a way to get 3Delight to do double-sided shading?
--
Also, the trivial shader doesn't really seem to highlight the differences between OSL, RSL and C++ shading. Where OSL can really shine w.r.t C++ shaders is when it has the chance agressively JIT optimize shader instances, and where the shader graph mechanics (calls, caching etc) can be minimized. Honestly, I can't see why Arnold running OSL should be significantly faster than C++ in this case. Perhaps someone more familiar with Arnold in its various forms can say more...?
I'm not familiar with Arnold's shading so I don't know if it's closer to RSL or OSL but there's more to OSL than the JIT. The forced use of closures allows some things to be done in the renderer which are impossible with RSL-style "you get the result right now" shading. The runtime optimizations and JIT are an added bonus but you could do much the same with any interpreted language.So even with a trivial shader, OSL can make a difference and this is in part what the test shows. That the difference is not in the shader's running time does not make it irrelevant. The design of OSL is just as important as the efficiency of its implementation.
--
... It doesn't hurt performance too badly for us but it would be beneficial if a renderer can more closely work with OSL input data format rather than wasting cycles in copying and reformatting data around.
Hopefully this is what 3Delight is doing internally :).
...
Also, the trivial shader doesn't really seem to highlight the differences between OSL, RSL and C++ shading. Where OSL can really shine w.r.t C++ shaders is when it has the chance agressively JIT optimize shader instances, and where the shader graph mechanics (calls, caching etc) can be minimized. Honestly, I can't see why Arnold running OSL should be significantly faster than C++ in this case. Perhaps someone more familiar with Arnold in its various forms can say more...?Like everyone else... obviously I'm curious about the renderer head-to-head but we'll really need to get Anders to throw Manuka into the ring and someone from Disney if we're going to have that showdown ;-)
--
Can you reveal anything about whether there is an API to extend the closure support (or other OSL facilities) for 3Delight?
On Jul 18, 2016, at 7:23 AM, Wayne Wooten <wlwo...@gmail.com> wrote:Indeed RIS tries to exploit coherence because otherwise with modern hardware you are leaving a lot of performance sitting idle.Our C++ patterns/BXDFs exploit this coherence, but sadly the current design of OSL makes this impossible for us. Even a different/alternate point index API would yield better performance than the current single point shading design. Fortunately the data swizzle doesn't hurt that much with complex shader networks like the ones used on "Finding Dory" and "Piper".
--WaynePS: The upcoming RenderMan 21 release acknowledges OSL is the future, so much so; we've pulled RSL from RenderMan 21.
Forking the thread slightly with a new subtopic, since Wayne brought it up.I'm aware of that HPG paper from Intel, it's certainly on my radar to revisit the batch shading issue. I'm quite keen to have input about what the API should be -- what should ShaderGlobals, the calls to execute & get_symbol, and the most important RendererServices member functions look like to support shading of multiple points, in order to fit into your respective renderers cleanly? If anybody wants to mock it up -- not how the guts work, just how the declarations are and/or how the setup and call from the renderer side should look -- that would help kick things off. The last thing I want to do is put a lot of work into an implementation and have it turn out to be a total mismatch to how the renderers think of the problem.
I think, honestly, that ShaderGlobals is the key to my getting a mental handle on it, how do you imagine the memory layout of the basic data that the renderer provides, such as P, N, etc.? Does
ShaderGlobalsBatch just replace the live data elements with a pointer for each field, which are assumed to be contiguous? Can we assume we start out with all n points running, or do we need
"runflags" coming in from the renderer because somehow the rays will end up with points sharing a material being "non-contiguous?" Does the initial batch size need to be fully general, or is it
sufficient/preferred to limit it to SIMD sizes (for example, only supporting multiples of 4, or only supporting SIMD sizes found in the wild: 1, 4, 8, 16), or only limit it to the actual number SIMD lanes per batch (4 for SSE, 8 for AVX, 16 for AVX-512/KNC) in order to eliminate any looping inside the execution.
Feedback always appreciated.
--
| time (fastest of 5) | total rays | total camera rays | total diffuse rays | total shadow rays | transparent rays | total points shaded | shading time | |
| 3DL ramp | 41.82 | 4.649E+07 | 1.039E+07 | 8.911E+06 | 8.072E+06 | 1.912E+07 | 4.6s | |
| Arnold ramp | 45.196 | 2.725E+07 | 8.522E+06 | 6.930E+06 | 1.179E+07 | 1.326E+08 | ||
| 3DL tex | 93.16 | 1.175E+08 | 1.039E+07 | 8.837E+06 | 8.000E+06 | 9.029E+07 | 10.7s | |
| Arnold tex | 52.949 | 2.991E+07 | 8.522E+06 | 7.882E+06 | 1.350E+07 | 9.915E+07 |
I expect OSL's shader execution itself to outperform batched interpreters because of the JIT to machine code and not having any interpreter overhead or explicit loops over points for each op (especially for small batches). And for complicated shader group networks, I expect it to outperform even precompiled C/C++ significantly, because of the extensive runtime optimization that happens with full knowledge of the bound instance parameter values and connectivity of the network.
But these benchmarks, with trivial shading, are not really about that. So what's it showing, exactly? Renderer-to-renderer comparisons are notoriously difficult. Even in the two 3Delight tests, they don't end up with the same number of rays. Can you say a bit more about why? Something is happening that's a bit more complicated than just swapping out the shading engine.
Aghiles, what's your best guess about WHY the OSL path on this scene is so much faster? Is it a difference in setup time for each shade?
Interpreter overhead from dealing with multiple points? Use of closures allow different approach to sampling? Something else?
Does the interpretation point to anything we can do on the OSL side to make it even more efficient for renderers?
But I guess my point is more that to implement a bssdf in a pure OSL context, you would need some robust raytrace() call support to perform this volume sampling (which then leads on to good QMC sample generators - which there are no hooks for in OSL currently, apart from random() )Have you thought about this scenario? Would that warrant a further C++ API?
On the other hand, having the memory live in the SG itself might help keep fragmentation to a minimum, help keep things in cache?
> On Jul 18, 2016, at 1:02 PM, Olivier Paquet <olivier...@gmail.com> wrote:
>
> I'd say turn float into float* and pass in __m128, __m256 etc as appropriate into them. A matrix becomes a matrix of __m128. Even strings, texture handles, etc should become arrays. Making exceptions and saying "but this will always be uniform" is asking for trouble down the road.
So you think the ShaderGlobals (or equivalent) should directly hold the arrays? Or should it hold pointers?
> The only point I can see to something more complex is if you intend to transfer the data over to the GPU to process much larger batches.
Well, that is a legit consideration. It's a different code path in many ways, but there may be merit to allowing the renderer side to have a uniform API regardless of whether the back end is SSE/AVX or has a GPU behind it. If we think that is a virtue, then it might make us lean more towards pointers in the struct that we pass.
On the other hand, having the memory live in the SG itself might help keep fragmentation to a minimum, help keep things in cache?
> Just a quick example: how do you handle lazy evaluation of layers if a value is used in two places, with different run flags? Something like:
>
> if( u < 0.25 )
> {
> use some connected parameter which requires evaluation of layer A
> }
> if( u > 0.75 )
> {
> use some connected parameter which requires evaluation of layer A
> }
Did you mean to say 'A' both times? That is no problem, A still runs the first time it's needed.
Now, I've oversimplified. The fact is that a lot of simple operations now will need to turn into masked writes and whatnot. But it does seem likely that we can still come out ahead.
But I also wonder if supporting something like 3) would cause a much more massive re-factoring of the OSL LLVM backend than 2)?
--
You received this message because you are subscribed to the Google Groups "OSL Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osl-dev+u...@googlegroups.com.
To post to this group, send email to osl...@googlegroups.com.
Visit this group at https://groups.google.com/group/osl-dev.
For more options, visit https://groups.google.com/d/optout.
Fully separating the x, y, z components of each 3-vector (and, ugh, even the Dual2<Vec>'s we use for derivatives) is significantly more trouble than turning int -> simd int, float -> simd float, but keeping 3-vectors together as a unit (padded so each 3-vector is a __m128). The latter scheme won't scale as well with 8-wide or 16-wide SIMD, but it sure is easier to deal with, and it would help speed up even the "scalar" mode like we do now.
Even in the absense of trying to shade multiple points at once, I've been contemplating changing the current system by padding the 3-vectors to be 4-vectors across the board, for the sake of being able to use SIMD ops internally for all the vector-vector math.
Excuse my lack of familiarity with how RIS is structured. When you say "PRMan RIS style", do you mean that the 3-vectors are kept contiguous, or fully separated?
I remember though initially being underwhelmed when I first tried swapping in __m128 for Vec3 maths and learning about how the %25 extra memory and the speed of the memory bus make a significant dent in the potential speed gains.
One other thing to be wary of (as told to me by someone much smarter - so I may be getting it wrong) is the slight downgrade in precision as I believe that traditional 32 bit single float math is still performed in the 80 bit registers and then truncated, where as SSE float lanes are literally 32 bits each and perform the math at this precision which can lead to more accumulated error. - not sure if this behaviour has changed.
Prman RIS has shader globals structured in vec3 array batches like this:Vec3 P[batchSize];Vec3 dPDu[batchSize];// etc
But those days are gone. Even what LOOKS like scalar ops these days are actually using the same ALU with the same old SIMD registers, just that only one lane is used. (That's my understanding, anyway.)
The only change I would make is to promote Vec3 to actually be laid out as a Vec4 underneath, so we can directly load and save them as __m128's without any of the funny tricks to load or store only 3 values.
There is the classic "load a 3D vector to a simd register" which have diminishing returns, still valuable returns for 2 reasons: 1) it's a bit faster than scalar code, so why not? 2) thee is always the case where a batch size approaches 1 and this execution is favorable to vector code where only 1 or 2 lanes are active.
Then there is the execution of a batch that is properly vectorized. I am not sure about the following but say that having the shaderglobal structure composed by a bunch of pointers do not introduces unnecessary dereferencung overhead, there is still the problem of data alignment:
Option 1:
do we have a pointer to P.x one to P.y, one to P.z, etc... This seems versatile. The use may define at jit time if the data will be aligned or not, therefore the jit will insert load or uload accordingly. The user should guarantee that safety buffer padding is available at the end of buffers to avoid segfaults due to page boundaries.
A argument to the api to run the shader may specify a batch size or offset (in accordance to the alignment requirement configured at joy time).
Let's see the positives:
- the interface is independent on the ISA.
- the user can decide to package memory continuously and limited fragmented access.
- the user may use arbitrary batch size (the implementation would not loop over the instructions, rather over whole shader execution on the ISA vector width.
Negatives:
- lots of code to fill in the struct with all those pointers.
- if the memory is not aligned properly, it'll crash hard.
Option 2:
The API provides different structs and calls that are appropriate to each ISA. The jit will be configured to run in accordance to a fixed ISA or multiple ISA dispatch where each API call variant will be matched. In other words we will have a structure wanting see types, or avx etc. it will be responsibility of the used to call the shader execution multiple times to flush the batch.
Pro:
- May look simpler to fill the shaderglobals strict, however the user would have to do that potentially many times for the batchSize/simdWidth.
- perhaps marginally faster due to implicit memory coherence (to be proven).
Cons:
API is explicit and more complex. Fewer optimizations can be done internally (I.e. execute using runtime detected ISA).
That find Symbol API should be changed in both cases. I believe we should not find symbols at all. We should know at jit time what symbols are requested and in which order will be serialized to a buffer that is provided by the user in the execute shader call. The spec should be rather simple. If the serialization order and types are known it is simple to jit instructions to write results. ISA-related padding will be implied.
Probably I have forgot some details, but I'm happy to workshop these, experiment on and contribute to it.
Max