3Delight OSL performance comparisons.

Aghiles Kheffache

unread,

Jul 15, 2016, 6:41:35 PM7/15/16

to OSL Developers

Hello,

We recently did some comparisons between 3Delight+OSL and C++ based Renderers and our own RSL based technology. It's quite hard to do proper comparisons but we tried to get some meaningful numbers and conclusions for you. Here is one test in particular.

https://3delight.atlassian.net/wiki/display/3DSP/Ray+Tracing+Stress+Test

Thanks,

--

Aghiles

Anders Langlands

unread,

Jul 15, 2016, 6:56:31 PM7/15/16

to OSL Developers

Interesting test. In your introduction you say "We will use only one diffuse bounce" but in the Arnold globals you show it's set to 2 bounces. This would also explain why the Arnold render is slightly brighter than yours. No idea what's going on in the RIS one either, it looks very different from the others.

--
You received this message because you are subscribed to the Google Groups "OSL Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osl-dev+u...@googlegroups.com.
To post to this group, send email to osl...@googlegroups.com.
Visit this group at https://groups.google.com/group/osl-dev.
For more options, visit https://groups.google.com/d/optout.

Aghiles Kheffache

unread,

Jul 15, 2016, 7:01:57 PM7/15/16

to OSL Developers

Hello Anders,

Actually, I had to set it to 2 to have the same bounces as in RenderMan/RIS and 3Delight. I was looking at the stats and it all kind of matches between all the renderers. But yes, that one is a bit weird and I am not sure why I have to set it to 2.

The change in brightness is probably due to some color space difference and is visible almost only in the R channel.

--

Aghiles

Anders Langlands

unread,

Jul 15, 2016, 7:05:35 PM7/15/16

to OSL Developers

So how many bounces are the RIS and 3Delight renders doing?

Aghiles Kheffache

unread,

Jul 15, 2016, 7:07:19 PM7/15/16

to osl...@googlegroups.com

From that stats, you have rays at level 0 (camera), rays at level 1 and rays at level 2. Same for Arnold.

This is actually just one diffuse bounce. Because the last level is direct/transmission ray.

You received this message because you are subscribed to a topic in the Google Groups "OSL Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/osl-dev/rbrJ_xJM0hU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to osl-dev+u...@googlegroups.com.

Anders Langlands

unread,

Jul 15, 2016, 7:17:34 PM7/15/16

to osl...@googlegroups.com

Right, but that's not how Arnold works. Setting diffuse depth (and total depth) to 0 will give you direct lighting only (a path length of 1), setting it to 1 will give you 1 diffuse bounce (path length of 2), and diffuse depth 2 will give you 2 diffuse bounces (path length of 3).

Don't get me wrong I'm all for OSL (I've been pushing for it in Arnold for a long time now) and your performance looks great, but I'm not sure this test is measuring what you think it's measuring.

Aghiles Kheffache

unread,

Jul 15, 2016, 7:30:46 PM7/15/16

to OSL Developers

Hello Anders!

Thanks a lot for the explanations.

I indeed tried to set diffuse bounces to 1. It got me less rays and one depth less in the statistics (which I trust). Performance multipliers were the same though.

The differences between each renderer makes it difficult. I went as far as time permits to get the same number of total rays for 3Delight OSL, Arnold and PrMan. I left the scenes in "Resources" as well as the stats. If something seems really fishy, I would be glad to correct the tests.

--

Aghiles

Anders Langlands

unread,

Jul 17, 2016, 3:12:14 AM7/17/16

to OSL Developers

I tried downloading your scenes but the 3Delight version only contains the RSL render pass, or at least that's the only one I can select in the globals. Am I missing something?

Paolo Berto

unread,

Jul 17, 2016, 3:15:08 AM7/17/16

to osl...@googlegroups.com

Anders,

you need to enable the "pre-release" features.

Docs (WIP) -- https://3delight.atlassian.net/wiki/display/3DFM/Pre-Release+Features

P

paolo berto durante
j cube inc. tokyo, japan
http://j-cube.jp

Anders Langlands

unread,

Jul 17, 2016, 5:43:01 AM7/17/16

to osl...@googlegroups.com

That worked, thanks Paolo! Running the tests I get results that seem to agree with the ones in the article with Arnold at 1 diffuse bounce. Interesting results!

Paolo Berto

unread,

Jul 17, 2016, 6:30:38 AM7/17/16

to osl...@googlegroups.com

No problem Anders.

Yes, the test were conducted accurately and the results presented correctly. The new renderer is very promising, we are testing it on several production assets and the results are totally impressive, also due to its ease of use. I intend to post some tests as soon as I have some time.

We (J Cube) have just finished converting all our shader library to OSL, it will be available to the public very soon.

You might want to check these:

https://3delight.atlassian.net/wiki/display/3DSP/General+Guidelines

https://3delight.atlassian.net/wiki/display/3DFM/Creating+Custom+OSL+HyperShade+Nodes

Some further notes on the 3Delight OSL renderer in Maya:

- Almost all Maya shading nodes are supported, some are missing but coming soon.

- Swatch rendering in 3Delight for Maya is still done with the RSL renderer even when the OSL Render Pass is selected. This will change shortly.

- Any sampling parameter on the materials distributed with 3Delight is ignored by the 3Delight OSL renderer. This is intended: the only quality controls are in the Quality section of the OSL Render Pass.

P

Wayne Wooten

unread,

Jul 17, 2016, 8:30:59 AM7/17/16

to osl...@googlegroups.com

So the RenderMan results have a couple of issues:

Their is an extra diffuse bounce, like in Arnold.

The light exposure in RenderMan is a power function, not linear like Maya lights.

The setup of the BXDF is doing double sided shading. Setting it to a single sided mode is faster.

We'll put together a better setup for RenderMan and send it your way, (we are busy preparing our 21.0 release of RenderMan at the moment).

Also, with our adaptive sampling, we converge faster by intentionally not throwing as many rays; after all in the end it's about speed to convergence. In our case denoise can make RenderMan even faster; our production metric is about speed to denoised convergence -- which is crucial to getting frames up on a big screen. I'd love to see some comparisons to a given ground truth RMS error metric of all 3 with all the features brought to bear.

But this is an OSL mailing list, not a my renderer is better than yours list. I'll e-mail links to and improved RenderMan scene off-list.

Thanks, that test is cool.

--Wayne

Aghiles Kheffache

unread,

Jul 17, 2016, 2:56:34 PM7/17/16

to OSL Developers

Hello Wayne!

Their is an extra diffuse bounce, like in Arnold.

Just to make it clear: all the renderers have exactly the same length path in the test. Which is 3. You can have a look at the stats file to convince yourself.

We took some time to make sure that everyone is on the same level, and it's not a surprise that the ray numbers between the 3 path tracers are almost the same.

Have a look at the stats, of you feel that there is still something wrong, let me know and we correct asap.

The light exposure in RenderMan is a power function, not linear like Maya lights.

The setup of the BXDF is doing double sided shading. Setting it to a single sided mode is faster.

We'll put together a better setup for RenderMan and send it your way, (we are busy preparing our 21.0 release of RenderMan at the moment).

Great let me know!

Also, with our adaptive sampling, we converge faster by intentionally not throwing as many rays; after all in the end it's about speed to convergence. In our case denoise can make RenderMan even faster; our production metric is about speed to denoised convergence -- which is crucial to getting frames up on a big screen. I'd love to see some comparisons to a given ground truth RMS error metric of all 3 with all the features brought to bear.

For this test we just wanted to have the same number of traced rays to compare "raw" performance. Note that adaptive sampling seems to behave erratically on this scene (high frequency details maybe?)

But this is an OSL mailing list, not a my renderer is better than yours list. I'll e-mail links to and improved RenderMan scene off-list.

Yes indeed. Frankly, the goal here is to show users and developers that there is no inherent performance penalty to OSL, on the contrary. In the hope that everyone switches to this common language instead of using archaic designs based on C++ or RSL.

Thanks, that test is cool.

Thanks a lot. I will fiddle a bit more with RenderMan/RIS to know it a bit better.

--

Aghiles

Dan Kripac

unread,

Jul 17, 2016, 5:28:19 PM7/17/16

to osl...@googlegroups.com

Hi Aghiles,

Firstly great to see this test even though it's quite hard to get these type of comparisons to be fair fight.

I have a question about the upcoming OSL integration in 3Delight.

Will you have any way (i.e API) to implemented new closures to extend the set that will ship with 3Delight?

And apart from closures will there be any other C++ API surrounding shading in 3Delight?

Cheers

Dan

Anders Langlands

unread,

Jul 17, 2016, 5:48:30 PM7/17/16

to osl...@googlegroups.com

On Mon, 18 Jul 2016 at 00:30 Wayne Wooten <wlwo...@gmail.com> wrote:

The setup of the BXDF is doing double sided shading. Setting it to a single sided mode is faster.

The arnold one has the same issue - is there a way to get 3Delight to do double-sided shading?

Paolo Berto

unread,

Jul 17, 2016, 7:47:56 PM7/17/16

to osl...@googlegroups.com

On Mon, Jul 18, 2016 at 6:48 AM, Anders Langlands <andersl...@gmail.com> wrote:

The arnold one has the same issue - is there a way to get 3Delight to do double-sided shading?

3Delight already does double-sided shading, by default. Like the others.

This is why you see the same disks in all renders.

If you want you can just turn off double-sided from the Maya shape render stats.

Note that despite seeing few less disks, rendering single-sided will be in fact slower in *both* 3Delight and Arnold (because you need more camera/visibility rays -- reflected by the stats).

Results that produce same imagery on my MBP 2.3GHz Intel core i7 (4 cores / 8 logical):

- 3Delight double sided: 52s

- 3Delight single sided: 61s

- Arnold double sided: 94s

- Arnold single sided: 109s

( Sorry, I dont have PRMan installed )

On Mon, 18 Jul 2016 at 00:30 Wayne Wooten <wlwo...@gmail.com> wrote:

> The light exposure in RenderMan is a power function, not linear like Maya lights.

Within regards to the two Maya area lights, their decay is set to a quadratic power function. Both 3Delight and Arnold can use the default Maya Area Light, I am not sure if some other special intensity profile is applied in PrMan's lights.

And, regarding The Denoiser, I don't think it should be considered in the metric and purpose of this test. Besides, does the world really need another superhero? :)

https://twitter.com/pberto/status/688018854040223745

P

Anders Langlands

unread,

Jul 17, 2016, 8:08:05 PM7/17/16

to osl...@googlegroups.com

That's weird because just opening the scene and rendering the 3delight one I get half the disks black. Actually I think something may be broken at my end as changing the max depth doesn't make any difference to the image - it could be computing direct lighting only?

--

Aghiles Kheffache

unread,

Jul 17, 2016, 8:55:04 PM7/17/16

to OSL Developers

Hello Anders,

Indeed, the free version you probably tested has the "one sided" problem. Here is the package used for the tests (I also added them to the page) :

http://www.3delight.com/packages/testing/WPOIOF/3delight-12.0.107-setup-x64.exe

http://www.3delight.com/packages/testing/WPOIOF/3delight-12.0.107-Linux-x86_64.tar.xz

http://www.3delight.com/packages/testing/WPOIOF/3delight-12.0.107-Darwin-Universal.dmg

There will be a watermark but the test results will be good.

Thanks for pointing out the problem,

--

Aghiles

Paolo Berto

unread,

Jul 17, 2016, 9:02:05 PM7/17/16

to osl...@googlegroups.com

Anders, for the avoidance of doubt, I am running 12.0.107, sorry I should have thought you were using the free version.

P

Daniel Heckenberg

unread,

Jul 18, 2016, 7:03:01 AM7/18/16

to osl...@googlegroups.com

I'm also not quite sure what's really being tested here. If it's 3delight's RSL vs OSL shading implementation then that's one thing... but the tests of other renderers obviously bring in a lot of other things into the mix.

Also, the trivial shader doesn't really seem to highlight the differences between OSL, RSL and C++ shading. Where OSL can really shine w.r.t C++ shaders is when it has the chance agressively JIT optimize shader instances, and where the shader graph mechanics (calls, caching etc) can be minimized. Honestly, I can't see why Arnold running OSL should be significantly faster than C++ in this case. Perhaps someone more familiar with Arnold in its various forms can say more...?

Like everyone else... obviously I'm curious about the renderer head-to-head but we'll really need to get Anders to throw Manuka into the ring and someone from Disney if we're going to have that showdown ;-)

Cheers,

Daniel

Olivier Paquet

unread,

Jul 18, 2016, 9:27:38 AM7/18/16

to OSL Developers

Le lundi 18 juillet 2016 07:03:01 UTC-4, Daniel Heckenberg a écrit :

Also, the trivial shader doesn't really seem to highlight the differences between OSL, RSL and C++ shading. Where OSL can really shine w.r.t C++ shaders is when it has the chance agressively JIT optimize shader instances, and where the shader graph mechanics (calls, caching etc) can be minimized. Honestly, I can't see why Arnold running OSL should be significantly faster than C++ in this case. Perhaps someone more familiar with Arnold in its various forms can say more...?

I'm not familiar with Arnold's shading so I don't know if it's closer to RSL or OSL but there's more to OSL than the JIT. The forced use of closures allows some things to be done in the renderer which are impossible with RSL-style "you get the result right now" shading. The runtime optimizations and JIT are an added bonus but you could do much the same with any interpreted language.

So even with a trivial shader, OSL can make a difference and this is in part what the test shows. That the difference is not in the shader's running time does not make it irrelevant. The design of OSL is just as important as the efficiency of its implementation.

Olivier

Dan Kripac

unread,

Jul 18, 2016, 9:46:44 AM7/18/16

to osl...@googlegroups.com

I'm not familiar with Arnold's shading so I don't know if it's closer to RSL or OSL but there's more to OSL than the JIT. The forced use of closures allows some things to be done in the renderer which are impossible with RSL-style "you get the result right now" shading. The runtime optimizations and JIT are an added bonus but you could do much the same with any interpreted language.

So even with a trivial shader, OSL can make a difference and this is in part what the test shows. That the difference is not in the shader's running time does not make it irrelevant. The design of OSL is just as important as the efficiency of its implementation.

I agree, this is why at Double Negative we chose to implement a full closure sampling OSL material shader for prman RIS rather than just a pattern evaluator. So we could use OSL as it was designed.

In RIS, RixBxdfFactory::BeginScatter() is analogous to an OSL shader evaluation which allows the RixBxdf::GenerateSample() and EvaulateSample() functions to get more efficient light/bsdf sampling milage out of a single AA shading sample (i.e OSL texturing graph sample).

The main issue though in RIS + OSLs case, is that RIS implements it's shading globals in a structure of arrays (a batch of shading points at a time) in order to try keep memory more coherent. Whereas OSL::ShaderGlobals is a single point structure so we need to convert between a structure of arrays to an array of structures.

It doesn't hurt performance too badly for us but it would be beneficial if a renderer can more closely work with OSL input data format rather than wasting cycles in copying and reformatting data around.

Hopefully this is what 3Delight is doing internally :).

Wayne Wooten

unread,

Jul 18, 2016, 10:23:37 AM7/18/16

to osl...@googlegroups.com

Indeed RIS tries to exploit coherence because otherwise with modern hardware you are leaving a lot of performance sitting idle.

https://voxelium.wordpress.com/2016/05/25/local-shading-coherence-extraction-for-simd-efficient-path-tracing-on-cpus/

Our C++ patterns/BXDFs exploit this coherence, but sadly the current design of OSL makes this impossible for us. Even a different/alternate point index API would yield better performance than the current single point shading design. Fortunately the data swizzle doesn't hurt that much with complex shader networks like the ones used on "Finding Dory" and "Piper".

--Wayne

PS: The upcoming RenderMan 21 release acknowledges OSL is the future, so much so; we've pulled RSL from RenderMan 21.

--

Aghiles Kheffache

unread,

Jul 18, 2016, 12:25:43 PM7/18/16

to OSL Developers

Hello Dan,

... It doesn't hurt performance too badly for us but it would be beneficial if a renderer can more closely work with OSL input data format rather than wasting cycles in copying and reformatting data around.

Hopefully this is what 3Delight is doing internally :).

Yes. We have build the new path-tracer around OSL so we don't have to jump through hoops. It pays off !

--

Aghiles

Aghiles Kheffache

unread,

Jul 18, 2016, 12:28:54 PM7/18/16

to OSL Developers, dan...@bogusfront.org

Hello Daniel,

On Monday, July 18, 2016 at 7:03:01 AM UTC-4, Daniel Heckenberg wrote:

...

Also, the trivial shader doesn't really seem to highlight the differences between OSL, RSL and C++ shading. Where OSL can really shine w.r.t C++ shaders is when it has the chance agressively JIT optimize shader instances, and where the shader graph mechanics (calls, caching etc) can be minimized. Honestly, I can't see why Arnold running OSL should be significantly faster than C++ in this case. Perhaps someone more familiar with Arnold in its various forms can say more...?

Like everyone else... obviously I'm curious about the renderer head-to-head but we'll really need to get Anders to throw Manuka into the ring and someone from Disney if we're going to have that showdown ;-)

More complex performance comparisons will be coming. Although complex C++ shaders fare even worse in that case. As Olivier explained in his reply, it's not about shader execution speed, it's about the abstraction of the "closure".

--

Aghiles

Dan Kripac

unread,

Jul 18, 2016, 1:38:27 PM7/18/16

to osl...@googlegroups.com

Hi Aghiles,

Good to know.

Can you reveal anything about whether there is an API to extend the closure support (or other OSL facilities) for 3Delight?

Cheers

Dan

--

Aghiles Kheffache

unread,

Jul 18, 2016, 1:55:18 PM7/18/16

to OSL Developers

Hey Dan,

Can you reveal anything about whether there is an API to extend the closure support (or other OSL facilities) for 3Delight?

Yes and we are thinking about the best way of doing it. We will probably provide an OSL-based extension mechanism. For two reasons:

1) LLVM generates code as good as C++ for our purpose

2) Portability and maintenance.

If you have some recommendation or if you want to do some tests, let me know,

--

Aghiles

Larry Gritz

unread,

Jul 18, 2016, 2:23:25 PM7/18/16

to osl...@googlegroups.com

Forking the thread slightly with a new subtopic, since Wayne brought it up.

I'm aware of that HPG paper from Intel, it's certainly on my radar to revisit the batch shading issue. I'm quite keen to have input about what the API should be -- what should ShaderGlobals, the calls to execute & get_symbol, and the most important RendererServices member functions look like to support shading of multiple points, in order to fit into your respective renderers cleanly? If anybody wants to mock it up -- not how the guts work, just how the declarations are and/or how the setup and call from the renderer side should look -- that would help kick things off. The last thing I want to do is put a lot of work into an implementation and have it turn out to be a total mismatch to how the renderers think of the problem.

I think, honestly, that ShaderGlobals is the key to my getting a mental handle on it, how do you imagine the memory layout of the basic data that the renderer provides, such as P, N, etc.? Does ShaderGlobalsBatch just replace the live data elements with a pointer for each field, which are assumed to be contiguous? Can we assume we start out with all n points running, or do we need "runflags" coming in from the renderer because somehow the rays will end up with points sharing a material being "non-contiguous?" Does the initial batch size need to be fully general, or is it sufficient/preferred to limit it to SIMD sizes (for example, only supporting multiples of 4, or only supporting SIMD sizes found in the wild: 1, 4, 8, 16), or only limit it to the actual number SIMD lanes per batch (4 for SSE, 8 for AVX, 16 for AVX-512/KNC) in order to eliminate any looping inside the execution.

Feedback always appreciated.

On Jul 18, 2016, at 7:23 AM, Wayne Wooten <wlwo...@gmail.com> wrote:

Indeed RIS tries to exploit coherence because otherwise with modern hardware you are leaving a lot of performance sitting idle.

https://voxelium.wordpress.com/2016/05/25/local-shading-coherence-extraction-for-simd-efficient-path-tracing-on-cpus/

Our C++ patterns/BXDFs exploit this coherence, but sadly the current design of OSL makes this impossible for us. Even a different/alternate point index API would yield better performance than the current single point shading design. Fortunately the data swizzle doesn't hurt that much with complex shader networks like the ones used on "Finding Dory" and "Piper".

--Wayne

PS: The upcoming RenderMan 21 release acknowledges OSL is the future, so much so; we've pulled RSL from RenderMan 21.

--
Larry Gritz
l...@larrygritz.com

Olivier Paquet

unread,

Jul 18, 2016, 4:02:16 PM7/18/16

to OSL Developers

Larry,

While we don't yet have much need for it (yet), here's my view anyway.

Le lundi 18 juillet 2016 14:23:25 UTC-4, Larry Gritz a écrit :

Forking the thread slightly with a new subtopic, since Wayne brought it up.

I'm aware of that HPG paper from Intel, it's certainly on my radar to revisit the batch shading issue. I'm quite keen to have input about what the API should be -- what should ShaderGlobals, the calls to execute & get_symbol, and the most important RendererServices member functions look like to support shading of multiple points, in order to fit into your respective renderers cleanly? If anybody wants to mock it up -- not how the guts work, just how the declarations are and/or how the setup and call from the renderer side should look -- that would help kick things off. The last thing I want to do is put a lot of work into an implementation and have it turn out to be a total mismatch to how the renderers think of the problem.

I'd say turn float into float* and pass in __m128, __m256 etc as appropriate into them. A matrix becomes a matrix of __m128. Even strings, texture handles, etc should become arrays. Making exceptions and saying "but this will always be uniform" is asking for trouble down the road.

I think, honestly, that ShaderGlobals is the key to my getting a mental handle on it, how do you imagine the memory layout of the basic data that the renderer provides, such as P, N, etc.? Does

The most simple is to replace every float with __m128, __m256, etc. Probably give every int the same treatment. So make it a big template on vector type.

ShaderGlobalsBatch just replace the live data elements with a pointer for each field, which are assumed to be contiguous? Can we assume we start out with all n points running, or do we need

Indirection might be nice with the larger vector sizes but I doubt it's a big deal for all but trivial shaders. Anything more complex as a load system (ie. to deal with data which is not contiguous) and it's probably better to just copy the data to one big contiguous ShaderGlobals so you don't do the complex load over and over again throughout the shader. This is a risk even with pointers to straight vectors if they happen to be in different pages.

The only point I can see to something more complex is if you intend to transfer the data over to the GPU to process much larger batches.

"runflags" coming in from the renderer because somehow the rays will end up with points sharing a material being "non-contiguous?" Does the initial batch size need to be fully general, or is it

We added those input flags fairly late in our RSL implementation because we never had a need for them. They were only ever used to propagate the flags on method calls between shaders but I'm not sure OSL will have this need. However, since you'll need flags internally anyway, it should be fairly easy to have them as input.

sufficient/preferred to limit it to SIMD sizes (for example, only supporting multiples of 4, or only supporting SIMD sizes found in the wild: 1, 4, 8, 16), or only limit it to the actual number SIMD lanes per batch (4 for SSE, 8 for AVX, 16 for AVX-512/KNC) in order to eliminate any looping inside the execution.

I think fixed size is better. Not only do you avoid looping, you also limit your memory footprint. There is little to gain from much larger batches and it makes the memory footprint of your shader grow quickly. Then you exceed the cache size, start having to manage "shader instance memory", etc. Don't go there. For what it's worth, our initial RSL engine in 3Delight was arbitrary width with loops but our JIT is machine size only. Larger batch sizes also increase the odds that you'll have to run both sides of a condition.

Feedback always appreciated.

I hope it was of some use. Our current view remains that there isn't that much need in OSL. It removes so much of the code while optimizing the shader group that there simply isn't much left to do. We can also say from lots of experience that going SIMD for shading is opening a whole barrel of multi-headed worms. Just a quick example: how do you handle lazy evaluation of layers if a value is used in two places, with different run flags? Something like:

if( u < 0.25 )

{

use some connected parameter which requires evaluation of layer A

}

if( u > 0.75 )

{

use some connected parameter which requires evaluation of layer A

}

A simple, obvious and very effective optimization just became a large headache.

Olivier

Larry Gritz

unread,

Jul 18, 2016, 4:15:31 PM7/18/16

to osl...@googlegroups.com

This has been a fascinating thread. I'm glad I stayed out of it long enough for interesting things to happen.

I'm still as confused by anybody about exactly what these benchmarks are showing and how they should be interpreted.

I expect OSL's shader execution itself to outperform batched interpreters because of the JIT to machine code and not having any interpreter overhead or explicit loops over points for each op (especially for small batches). And for complicated shader group networks, I expect it to outperform even precompiled C/C++ significantly, because of the extensive runtime optimization that happens with full knowledge of the bound instance parameter values and connectivity of the network.

But these benchmarks, with trivial shading, are not really about that. So what's it showing, exactly? Renderer-to-renderer comparisons are notoriously difficult. Even in the two 3Delight tests, they don't end up with the same number of rays. Can you say a bit more about why? Something is happening that's a bit more complicated than just swapping out the shading engine.

Aghiles, what's your best guess about WHY the OSL path on this scene is so much faster? Is it a difference in setup time for each shade? Interpreter overhead from dealing with multiple points? Use of closures allow different approach to sampling? Something else?

Does the interpretation point to anything we can do on the OSL side to make it even more efficient for renderers?

--
Larry Gritz
l...@larrygritz.com

Dan Kripac

unread,

Jul 18, 2016, 8:23:46 PM7/18/16

to osl...@googlegroups.com

wow, an OSL-based closure extension would be amazing!

This would be closer to how Mantra implements it's bsdfs in pure VEX.

I have my fingers crossed it's contributable back into the OSL project :)

I guess the only thing that has plagued me when thinking about implementing cross-renderer OSL support is implementing bssdfs.

Each renderer has different requirements for how these interact with the primary integrator.

As RIS uses a formalised volume integrator API to perform sss. We use a subsurface() closure to collect data in OSL over an RIS shading batch to use as the initialiser data for a RixSSDiffusion volume integrator.

But I guess my point is more that to implement a bssdf in a pure OSL context, you would need some robust raytrace() call support to perform this volume sampling (which then leads on to good QMC sample generators - which there are no hooks for in OSL currently, apart from random() )

Have you thought about this scenario? Would that warrant a further C++ API?

--

Larry Gritz

unread,

Jul 18, 2016, 8:38:56 PM7/18/16

to osl...@googlegroups.com

I think we'd be very interested in an officially sanctioned extension mechanism to add new BSDFs/closures, regardless of whether it is truly expressed in OSL or in C++ as plugins, and whether it is contributed or written by us.

The only reason it's not there now is that merely within SPI's renderer, we've gone through so many revisions of what we think the underlying closure mechanism should be (e.g., specifically what methods are required and their parameters), I'm not sure we ever felt it was stable enough to be enshrined in the language spec. Getting agreement about it across renderers and studios seems even more difficult.

But if it's possible to get such consensus, I think it would be extremely valuable, and as far as I'm concerned would help to plug the hole that is the most glaring missing item from OSL -- a "standard" set of material closures that people could truly count on to be available in any OSL-supporting renderer. Individual renderers are still free to have additional proprietary ones, but having a baseline covering the usual common cases would be awesome.

It may perhaps have been hard for us at SPI, during the lean years of being the only OSL renderer, to have confidence that we knew the "right" way to do implement closures and the right set of closures to support. Maybe now with a bunch of OSL renderers (and the smart people behind them) and vastly more OSL-in-production experience, we can all look at what each other has done and arrive at some kind of consensus about the best way that seems most likely to be stable going forward.

--
Larry Gritz
l...@larrygritz.com

Anders Langlands

unread,

Jul 18, 2016, 10:18:28 PM7/18/16

to osl...@googlegroups.com

Thanks Aghiles and Paolo. With some slight adjustments to the scene (gamma and color settings) I can get a pretty close (just noise difference AFAICT) match between 3delightOSL and Arnold.

I ran two sets of tests. One with the shading setup from the original scene and one replacing the ramp with a texture to see what happens when we make the shading network do more work that's not optimizable by OSL.

All were rendered with one bounce of indirect diffuse, i.e. max depth 2 in 3DL and max depth 1 in Arnold (since Arnold counts from the first intersection, not from the camera). Machine is a 2013 rMBP with 4 physical cores running with 8 threads. Here's the stats. Unfortunately each renderer spits out only a partially intersecting set of interesting ones.

	time (fastest of 5)	total rays	total camera rays	total diffuse rays	total shadow rays	transparent rays	total points shaded	shading time
3DL ramp	41.82	4.649E+07	1.039E+07	8.911E+06	8.072E+06	1.912E+07		4.6s
Arnold ramp	45.196	2.725E+07	8.522E+06	6.930E+06	1.179E+07		1.326E+08

3DL tex	93.16	1.175E+08	1.039E+07	8.837E+06	8.000E+06	9.029E+07		10.7s
Arnold tex	52.949	2.991E+07	8.522E+06	7.882E+06	1.350E+07		9.915E+07

Overall the number of diffuse rays is pretty similar. Arnold's shooting a lot more shadow rays, which I guess means the 3DL isn't doing MIS for diffuse BSDFs next event estimation? I'm also not sure why 3DL is shooting so many more camera rays (that's more than 4*1920*1080).The difference in the total rays appears be down to the "transparent rays" which I guess is 3DL counting continuations? I don't think Arnold actually counts those in its ray totals. If you add up camera+diffuse+shadow then you end up with almost the same number for each renderer (2.738e+7 vs 2.725e+7).

So... for this test scene the 3DL one is about 7.5% faster on my machine, which could well be down to optimizations in the ramp shader I guess? Replacing the ramp with a texture makes 3DL a lot slower, but how much is down to OSL vs C and how much is down to Arnold's texture engine being faster is hard to say.

Cheers,

Anders

Aghiles Kheffache

unread,

Jul 18, 2016, 10:36:32 PM7/18/16

to OSL Developers

I expect OSL's shader execution itself to outperform batched interpreters because of the JIT to machine code and not having any interpreter overhead or explicit loops over points for each op (especially for small batches). And for complicated shader group networks, I expect it to outperform even precompiled C/C++ significantly, because of the extensive runtime optimization that happens with full knowledge of the bound instance parameter values and connectivity of the network.

That is the case .

But these benchmarks, with trivial shading, are not really about that. So what's it showing, exactly? Renderer-to-renderer comparisons are notoriously difficult. Even in the two 3Delight tests, they don't end up with the same number of rays. Can you say a bit more about why? Something is happening that's a bit more complicated than just swapping out the shading engine.

It's a bit about "that" actually: for example, this test stresses the RSL system significantly because there is a high initialization cost to the shading engine contrary to your OSL system. So we can say that at least it tests the quality of integration of OSL into a given rendering engine.

Aghiles, what's your best guess about WHY the OSL path on this scene is so much faster? Is it a difference in setup time for each shade?

Exactly.

Interpreter overhead from dealing with multiple points? Use of closures allow different approach to sampling? Something else?

Yes and yes. And something else: having a very clean design in OSL, with a very simple interface, setup and execution semantics allows us to "see more clearly" and have a lighter, more efficient design all around. Now that we don't have to worry about the shading engine we can concentrate on other things and implement the many ideas we had through the years.

Does the interpretation point to anything we can do on the OSL side to make it even more efficient for renderers?

I think you pretty much nailed it. That's my interpretation! As Olivier mentioned in another thread, be careful with the SIMD, it could lead you to uncomfortable places and take your attention away from things that matter more (optimization, language semantics, closure extensions, ...).

Thanks,

--

Aghiles

Aghiles Kheffache

unread,

Jul 18, 2016, 11:01:20 PM7/18/16

to OSL Developers

Hello Anders,

Thanks for the tests. Send the scenes my way!

Regarding MIS: of course we do it. Not sure about the differences with Arnold.

Regarding texturing: unfortunately we are still using a very inefficient texturing system in this alpha release. We are thinkingabout using OpenImageIO or adapting the RSL one to the OSL path tracer. We will keep this list posted about when the proper system will be in place.

Thanks for taking the time to go through the tests,

--

Aghiles

Anders Langlands

unread,

Jul 19, 2016, 6:38:18 AM7/19/16

to OSL Developers

Hi Aghiles, I've uploaded a maya project with the scene etc in it here:

https://drive.google.com/file/d/0B_z8FmhJmV9rY3ZYN00tdllhOUU/view?usp=sharing

I haven't gone through and checked in detail so if you find anything missing please let me know.

Cheers,

Anders

Olivier Paquet

unread,

Jul 19, 2016, 9:39:11 AM7/19/16

to OSL Developers

Le lundi 18 juillet 2016 20:23:46 UTC-4, Dan Kripac a écrit :

But I guess my point is more that to implement a bssdf in a pure OSL context, you would need some robust raytrace() call support to perform this volume sampling (which then leads on to good QMC sample generators - which there are no hooks for in OSL currently, apart from random() )

Have you thought about this scenario? Would that warrant a further C++ API?

There are no plans to have custom integrators. There are many reasons why we believe these should stay in the renderer core and not be plugins.

What we meant is custom BSDFs but even that is tricky because of the "what does it really need to do" issue Larry brought up. It would be ideal if we could come up with something everyone agrees with and which could be standardized in OSL. It is the one big missing feature in OSL.

Olivier

Larry Gritz

unread,

Jul 19, 2016, 7:01:29 PM7/19/16

to osl...@googlegroups.com

> On Jul 18, 2016, at 1:02 PM, Olivier Paquet <olivier...@gmail.com> wrote:
>
> I'd say turn float into float* and pass in __m128, __m256 etc as appropriate into them. A matrix becomes a matrix of __m128. Even strings, texture handles, etc should become arrays. Making exceptions and saying "but this will always be uniform" is asking for trouble down the road.

So you think the ShaderGlobals (or equivalent) should directly hold the arrays? Or should it hold pointers?

I agree that uniforms are more trouble than they're worth, and might even slow things down by constantly needing to be recast into 4- or 8-wide versions.

> The only point I can see to something more complex is if you intend to transfer the data over to the GPU to process much larger batches.

Well, that is a legit consideration. It's a different code path in many ways, but there may be merit to allowing the renderer side to have a uniform API regardless of whether the back end is SSE/AVX or has a GPU behind it. If we think that is a virtue, then it might make us lean more towards pointers in the struct that we pass.

On the other hand, having the memory live in the SG itself might help keep fragmentation to a minimum, help keep things in cache?

> We can also say from lots of experience that going SIMD for shading is opening a whole barrel of multi-headed worms.

Certainly. There are good reasons why we haven't rushed into this.

> Just a quick example: how do you handle lazy evaluation of layers if a value is used in two places, with different run flags? Something like:
>
> if( u < 0.25 )
> {
> use some connected parameter which requires evaluation of layer A
> }
> if( u > 0.75 )
> {
> use some connected parameter which requires evaluation of layer A
> }

Did you mean to say 'A' both times? That is no problem, A still runs the first time it's needed.

Let's pretend you meant the second one to be B. For illustration, I'll write a more concrete example:

> if( u < 0.25 ) {
> x += A; // assume A and B are connections from different upstream nodes
> } else {
> x += B;
> }

How bad is this?

Let's make some assumptions:
* "SIMD if" is smart enough to skips its body if none of the SIMD lanes evaluate to true.
* cost(A) == cost(B)
* 4-wide SIMD, and let's assume for the time being that it's "perfect" (e.g. a 4-way SIMD op costs the same as a single scalar equivalent)
* we have 4 points to shade

(Yeah, yeah, we can haggle over any of these being strictly true, but for now, they just let us do some back-of-envelope analysis.)

So, for the scalar case,

scalar cost = 4 * (cost(<) + cost(+) + cost(A))

I used cost(A) assuming B is about the same. Some points may eval A, some may eval B, but the cost is equivalent.

simd cost = (1+P) * (cost(<) + cost(+) + cost(A))

where P is the probability that the batch of 4 shades will need both A and B. Note that 1+P is at most 2, even if we never catch a break and all our 4-batches include a mixture of u<0.25 and u>=0.25.

So even though it feels like we are wastefully executing both A and B branches much of the time -- maybe all of the time -- we still come out quite a bit ahead. The double execution certainly eats into the SIMD efficiency, but it's still always better than running a total of 4 A's, and we also still come out far ahead on any code that's not inside these "redundant ifs".

Now, I've oversimplified. The fact is that a lot of simple operations now will need to turn into masked writes and whatnot. But it does seem likely that we can still come out ahead.

--
Larry Gritz
l...@larrygritz.com

Dan Kripac

unread,

Jul 20, 2016, 7:17:10 AM7/20/16

to osl...@googlegroups.com

On the other hand, having the memory live in the SG itself might help keep fragmentation to a minimum, help keep things in cache?

I think that OSL::ShaderGlobals should be an interface between the renderer and OSL and the memory pattern behind it should be left to the needs of the renderer.

Even in a simple case interface such as just switching the fields to pointers, there's nothing stopping you from having the current OSL::ShaderGlobals struct as the backend (i.e with a stream size of 1).

The interface ShaderGlobals then just becomes a layer of indirection to allow for different memory patterns to be used behind the scenes.

Leaving GPU memory preferences aside (as I'm no expert), I guess in reality there are roughly three main memory layouts for a CPU shading globals:

1) single SG struct - current

1b) AOS - an array of the above

2) SOA - Prman RIS style

3) SOA - vector component style

With the 3rd structure I'm referring to something like this.

struct AvxBackEndShaderGlobals {

struct SVec3 { __m256 x, y, z; };

SVec3 P, dPdx, dPdy;

SVec3 dPdz;

/// etc ...

};

I've always been really curious about what sort of performance (gain?) you would get if the entire shading pipeline was in proper SIMD component arrays, obviously implemented where the 1, 4, 8, and 16 (and beyond) wide cases are all handled (similar to embree).

But I also wonder if supporting something like 3) would cause a much more massive re-factoring of the OSL LLVM backend than 2)?

Olivier Paquet

unread,

Jul 20, 2016, 11:04:03 AM7/20/16

to OSL Developers

Le mardi 19 juillet 2016 19:01:29 UTC-4, Larry Gritz a écrit :

> On Jul 18, 2016, at 1:02 PM, Olivier Paquet <olivier...@gmail.com> wrote:
>
> I'd say turn float into float* and pass in __m128, __m256 etc as appropriate into them. A matrix becomes a matrix of __m128. Even strings, texture handles, etc should become arrays. Making exceptions and saying "but this will always be uniform" is asking for trouble down the road.

So you think the ShaderGlobals (or equivalent) should directly hold the arrays? Or should it hold pointers?

I think it should hold the arrays for the smaller SIMD sizes. Pointers are better for the larger sizes (GPU-sized batches, probably avx-512 as well). So it comes down to what you ultimately want to do with this. If it were me, I'd start with arrays in the ShaderGlobals but put the loads and stores behind an abstraction layer so it can be changed without rewriting the implementation of all shading functions. This would make it easy to switch to pointers for avx-512, add a global offset for easy indexing into larger arrays, etc.

> The only point I can see to something more complex is if you intend to transfer the data over to the GPU to process much larger batches.

Well, that is a legit consideration. It's a different code path in many ways, but there may be merit to allowing the renderer side to have a uniform API regardless of whether the back end is SSE/AVX or has a GPU behind it. If we think that is a virtue, then it might make us lean more towards pointers in the struct that we pass.

True, there is a lot of value in allowing a single piece of code to run any batch size.

On the other hand, having the memory live in the SG itself might help keep fragmentation to a minimum, help keep things in cache?

Yes, that's my main worry about using pointers and why I think they might not be so great with batches of 4 or 8. And I think these small batch sizes might be important as they are large enough to use the vector hardware well but small enough not to suffer too much from the other drawbacks of vectorization. But I suppose it will not be too bad if the data is packed anyway, right next to the pointers. I tend be a bit paranoid about performance of some things because of many past bad surprises.

> Just a quick example: how do you handle lazy evaluation of layers if a value is used in two places, with different run flags? Something like:
>
> if( u < 0.25 )
> {
> use some connected parameter which requires evaluation of layer A
> }
> if( u > 0.75 )
> {
> use some connected parameter which requires evaluation of layer A
> }

Did you mean to say 'A' both times? That is no problem, A still runs the first time it's needed.

You just made my point ;-)

How do you run A? Do you run it with the runflags of (u < 0.25) ? If so, you can't reuse it the second time around as it's missing the values for (u > 0.75). Your SIMD efficiency goes down.

Then do you run it with no runflags? Sounds fine as computing the extra values is mostly free with SSE/AVX... but what if it's a large network with noise calls, texture lookups, etc? Some of those support functions will definitely have an extra cost for every point which is "on" and you've just made the lazy evaluation useless for them.

Now this is not a problem if the conditions are all-true or all-false most of the time but it's a risky thing to gamble on. Count yourself fortunate that OSL's design means you probably don't have to deal with missing inputs either (as you would for a function call) or you simply could not just run the thing with all flags set to true.

And of course you need to add the low-level cost of SIMD to this: the masked loads and stores, testing for the all-true or all-false cases, having to sometimes use less efficient algorithms, increased pressure on the cache, etc.

All this to say that it's very easy to make things SIMD and end up slower than you started. I've done it quite a few times for functions much simpler than entire shading networks.

Now, I've oversimplified. The fact is that a lot of simple operations now will need to turn into masked writes and whatnot. But it does seem likely that we can still come out ahead.

Most likely, yes. But it will be difficult. Don't let that stop you from having fun with it though :-)

Olivier

Larry Gritz

unread,

Jul 20, 2016, 11:19:05 AM7/20/16

to osl...@googlegroups.com

But I also wonder if supporting something like 3) would cause a much more massive re-factoring of the OSL LLVM backend than 2)?

Fully separating the x, y, z components of each 3-vector (and, ugh, even the Dual2<Vec>'s we use for derivatives) is significantly more trouble than turning int -> simd int, float -> simd float, but keeping 3-vectors together as a unit (padded so each 3-vector is a __m128). The latter scheme won't scale as well with 8-wide or 16-wide SIMD, but it sure is easier to deal with, and it would help speed up even the "scalar" mode like we do now.

Even in the absense of trying to shade multiple points at once, I've been contemplating changing the current system by padding the 3-vectors to be 4-vectors across the board, for the sake of being able to use SIMD ops internally for all the vector-vector math.

Excuse my lack of familiarity with how RIS is structured. When you say "PRMan RIS style", do you mean that the 3-vectors are kept contiguous, or fully separated?

--
You received this message because you are subscribed to the Google Groups "OSL Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osl-dev+u...@googlegroups.com.
To post to this group, send email to osl...@googlegroups.com.
Visit this group at https://groups.google.com/group/osl-dev.
For more options, visit https://groups.google.com/d/optout.

--
Larry Gritz
l...@larrygritz.com

Dan Kripac

unread,

Jul 20, 2016, 11:47:26 AM7/20/16

to osl...@googlegroups.com

Fully separating the x, y, z components of each 3-vector (and, ugh, even the Dual2<Vec>'s we use for derivatives) is significantly more trouble than turning int -> simd int, float -> simd float, but keeping 3-vectors together as a unit (padded so each 3-vector is a __m128). The latter scheme won't scale as well with 8-wide or 16-wide SIMD, but it sure is easier to deal with, and it would help speed up even the "scalar" mode like we do now.

Yeah fair enough. I suppose the backend would have needed to be designed like that from the ground up.

Even in the absense of trying to shade multiple points at once, I've been contemplating changing the current system by padding the 3-vectors to be 4-vectors across the board, for the sake of being able to use SIMD ops internally for all the vector-vector math.

Yeah I guess as the shader complexity grows and more values are kept in local cache you would gain more performance out of this.

I remember though initially being underwhelmed when I first tried swapping in __m128 for Vec3 maths and learning about how the %25 extra memory and the speed of the memory bus make a significant dent in the potential speed gains.

One other thing to be wary of (as told to me by someone much smarter - so I may be getting it wrong) is the slight downgrade in precision as I believe that traditional 32 bit single float math is still performed in the 80 bit registers and then truncated, where as SSE float lanes are literally 32 bits each and perform the math at this precision which can lead to more accumulated error. - not sure if this behaviour has changed.

But you would get this with any use of SIMD so I guess it's part of the rules of engagement.

And shading is potentially not as sensitive to this as ray intersection is.

Excuse my lack of familiarity with how RIS is structured. When you say "PRMan RIS style", do you mean that the 3-vectors are kept contiguous, or fully separated?

Prman RIS has shader globals structured in vec3 array batches like this:

Vec3 P[batchSize];

Vec3 dPDu[batchSize];

// etc

Larry Gritz

unread,

Jul 20, 2016, 1:36:23 PM7/20/16

to osl...@googlegroups.com

On Jul 20, 2016, at 8:47 AM, Dan Kripac <dank...@gmail.com> wrote:

I remember though initially being underwhelmed when I first tried swapping in __m128 for Vec3 maths and learning about how the %25 extra memory and the speed of the memory bus make a significant dent in the potential speed gains.

It's not a 4x gain, true.

IIRC, a bigger loss is when you end up mixing scalar and vector ops. There are certain scalar things that stall the vector pipeline.

One other thing to be wary of (as told to me by someone much smarter - so I may be getting it wrong) is the slight downgrade in precision as I believe that traditional 32 bit single float math is still performed in the 80 bit registers and then truncated, where as SSE float lanes are literally 32 bits each and perform the math at this precision which can lead to more accumulated error. - not sure if this behaviour has changed.

I don't think that's true any more. It was the stack-based 8087-style ops that were 80 bits, and this often resulted in weirdness because, for example, A-B could have different values depending on whether A and/or B had been kept in the 8087 80-bit stack registers the whole time or whether the compiler at any point had needed to store them in memory (truncating to 64 bits) and later retrieve them, which was a process not under easy control of the app programmer. It was ugly.

But those days are gone. Even what LOOKS like scalar ops these days are actually using the same ALU with the same old SIMD registers, just that only one lane is used. (That's my understanding, anyway.)

Prman RIS has shader globals structured in vec3 array batches like this:

Vec3 P[batchSize];
Vec3 dPDu[batchSize];
// etc

Good, so we could proceed with the "obvious" (and MUCH less arduous) method without feeling like it's a total mismatch for PRMan.

The only change I would make is to promote Vec3 to actually be laid out as a Vec4 underneath, so we can directly load and save them as __m128's without any of the funny tricks to load or store only 3 values.

--
Larry Gritz
l...@larrygritz.com

Dan Kripac

unread,

Jul 20, 2016, 2:51:57 PM7/20/16

to osl...@googlegroups.com

But those days are gone. Even what LOOKS like scalar ops these days are actually using the same ALU with the same old SIMD registers, just that only one lane is used. (That's my understanding, anyway.)

Interesting, and good to know!

The only change I would make is to promote Vec3 to actually be laid out as a Vec4 underneath, so we can directly load and save them as __m128's without any of the funny tricks to load or store only 3 values.

Sure, well hopefully the copy conversion in RIS' case from a Vec3 array to Vec4 array will be amortised by faster shading execution.

But would be interesting (for me maybe :) to see if using the RIS globals memory in-place with traditional Vec3 OSL execution has an impact too.

Max Liani

unread,

Jul 23, 2016, 6:25:56 PM7/23/16

to OSL Developers

Hi Larry.
I believe there are two ways in which vectorization can be effectively achieved (I may repeat some of what already said in this thread).

There is the classic "load a 3D vector to a simd register" which have diminishing returns, still valuable returns for 2 reasons: 1) it's a bit faster than scalar code, so why not? 2) thee is always the case where a batch size approaches 1 and this execution is favorable to vector code where only 1 or 2 lanes are active.

Then there is the execution of a batch that is properly vectorized. I am not sure about the following but say that having the shaderglobal structure composed by a bunch of pointers do not introduces unnecessary dereferencung overhead, there is still the problem of data alignment:
Option 1:
do we have a pointer to P.x one to P.y, one to P.z, etc... This seems versatile. The use may define at jit time if the data will be aligned or not, therefore the jit will insert load or uload accordingly. The user should guarantee that safety buffer padding is available at the end of buffers to avoid segfaults due to page boundaries.
A argument to the api to run the shader may specify a batch size or offset (in accordance to the alignment requirement configured at joy time).

Let's see the positives:
- the interface is independent on the ISA.
- the user can decide to package memory continuously and limited fragmented access.
- the user may use arbitrary batch size (the implementation would not loop over the instructions, rather over whole shader execution on the ISA vector width.

Negatives:
- lots of code to fill in the struct with all those pointers.
- if the memory is not aligned properly, it'll crash hard.

Option 2:
The API provides different structs and calls that are appropriate to each ISA. The jit will be configured to run in accordance to a fixed ISA or multiple ISA dispatch where each API call variant will be matched. In other words we will have a structure wanting see types, or avx etc. it will be responsibility of the used to call the shader execution multiple times to flush the batch.

Pro:
- May look simpler to fill the shaderglobals strict, however the user would have to do that potentially many times for the batchSize/simdWidth.
- perhaps marginally faster due to implicit memory coherence (to be proven).

Cons:
API is explicit and more complex. Fewer optimizations can be done internally (I.e. execute using runtime detected ISA).

That find Symbol API should be changed in both cases. I believe we should not find symbols at all. We should know at jit time what symbols are requested and in which order will be serialized to a buffer that is provided by the user in the execute shader call. The spec should be rather simple. If the serialization order and types are known it is simple to jit instructions to write results. ISA-related padding will be implied.

Probably I have forgot some details, but I'm happy to workshop these, experiment on and contribute to it.

Max

Solomon Boulos

unread,

Jul 23, 2016, 8:32:11 PM7/23/16

to osl...@googlegroups.com

Max, modern processors don't crash on unaligned loads for vectorization. In fact, since ... Sandybridge? the intel compilers generate unaligned vector loads because "don't care". There's no benefit to having hard-coded, embedded _mm128s or _mm256s (and it's certainly less flexible since it mandates how chubby the memory for the struct is and prevents reuse: like pointing both dNdu and dNdv to the same pile of 0s).

And since the benefit of 3 in 4,8,16-simd is a minor win at best, but a *huge* pain for the code, I'd vote against. Having real vectorization is worth complexity, 10% or something is probably not.

Reply all

Reply to author

Forward