[angleproject] Solving slow compilation of long loops with texture sampling

Alvaro Segura

unread,

Mar 30, 2011, 7:09:14 AM3/30/11

to anglep...@googlegroups.com

Hi All,

This post complements our previous post "[angleproject] Solving slow compilation (and eventual fail) in complex shaders (with patch)" with a related but different problem. Previously we discussed a solution for slow compilation of long loops by preventing their unrolling. It has been said that Chrome 12 will compile shaders without unrolling to improve this. We have tested Chrome 12 and certainly improves fractal.io as much as our solution with "[loop] [fastopt] for (...)".

Now, there are other cases, such as that reported by John Davis (http://www.pcprogramming.com/flight.html), and others, where loops need to do texture sampling (i.e. texture2D(tex, uv); ).

The problem here seems to be the following or a similar issue: texture2D() translates to tex2D() in HLSL. tex2D is said to be a "gradient instruction" because it uses mipmapping (even if we defined MIN_FILTER LINEAR!). HLSL does not allow gradient instructions in true loops (at least in loops with "break" which I can't find is that demo (?)), so upon seeing that call, DX forces an unrolling, even if Chrome 12 is trying to avoid that. The result is a very long compile time and a possible error if the loop is too deep and can't be unrolled.

In HLSL mipmapping can be avoided by using tex2Dlod(tex, uv, 0, 0) [the 0s being the levels chosen]. Great. Is there a similar function in GLSL? yes: texture2DLod(). Can the shader developer just change it? No: GLSL only allows texture2DLod() in vertex shaders, not fragment shaders.

A solution implies Angle emitting a HLSL "tex2Dlod(...,0,0)" instruction from a source GLSL "texture2D(...)". We tested this, again in a custom-built Angle library. And it worked great for our application which does heavy texture sampling inside long loops.

Can that translation be done always as in our naive fix? Not really. There are plenty of applications that need correct mipmapped sampling for good minification of textures. The translation then needs to be done selectively, only where necessary. The rule can be to check whether the texture2D call is inside a loop or when inside a loop of more than N iterations. [the first seems easier to do]:

texture2D(a,b); out of loop => tex2D(a,b);
texture2D(a,b); inside loop => tex2Dlod(a, b, 0, 0);

We propose this change or equivalent fixes as we definitely need sampling in loops, following the necessary testing so it does not break anything. Pure OpenGL works with no problems, BTW.

For reference, we are using Angle SVN trunk (rev 598), compiled under VS2008 and then, replacing resulting DLL in FF4. I have upgraded MS SDK Platform and DX SDK to the latest one, having a Win7 64bits PC with a nVidia GTX485. BTW, what is different in Chrome 12? doesn't it use regular SVN-trunk Angle?

Best regards

Daniel Koch

unread,

Mar 30, 2011, 4:43:45 PM3/30/11

to ase...@vicomtech.org, anglep...@googlegroups.com

On 2011-03-30, at 7:09 AM, Alvaro Segura wrote:

Hi All,

This post complements our previous post "[angleproject] Solving slow compilation (and eventual fail) in complex shaders (with patch)" with a related but different problem. Previously we discussed a solution for slow compilation of long loops by preventing their unrolling. It has been said that Chrome 12 will compile shaders without unrolling to improve this. We have tested Chrome 12 and certainly improves fractal.io as much as our solution with "[loop] [fastopt] for (...)".

Now, there are other cases, such as that reported by John Davis (http://www.pcprogramming.com/flight.html), and others, where loops need to do texture sampling (i.e. texture2D(tex, uv); ).

The problem here seems to be the following or a similar issue: texture2D() translates to tex2D() in HLSL. tex2D is said to be a "gradient instruction" because it uses mipmapping (even if we defined MIN_FILTER LINEAR!). HLSL does not allow gradient instructions in true loops (at least in loops with "break" which I can't find is that demo (?)), so upon seeing that call, DX forces an unrolling, even if Chrome 12 is trying to avoid that. The result is a very long compile time and a possible error if the loop is too deep and can't be unrolled.

In HLSL mipmapping can be avoided by using tex2Dlod(tex, uv, 0, 0) [the 0s being the levels chosen]. Great. Is there a similar function in GLSL? yes: texture2DLod(). Can the shader developer just change it? No: GLSL only allows texture2DLod() in vertex shaders, not fragment shaders.

A solution implies Angle emitting a HLSL "tex2Dlod(...,0,0)" instruction from a source GLSL "texture2D(...)". We tested this, again in a custom-built Angle library. And it worked great for our application which does heavy texture sampling inside long loops.

Can that translation be done always as in our naive fix? Not really. There are plenty of applications that need correct mipmapped sampling for good minification of textures. The translation then needs to be done selectively, only where necessary. The rule can be to check whether the texture2D call is inside a loop or when inside a loop of more than N iterations. [the first seems easier to do]:

texture2D(a,b); out of loop => tex2D(a,b);
texture2D(a,b); inside loop => tex2Dlod(a, b, 0, 0);

That doesn't really seem like it would be a safe universal change though...

We propose this change or equivalent fixes as we definitely need sampling in loops, following the necessary testing so it does not break anything. Pure OpenGL works with no problems, BTW.

For reference, we are using Angle SVN trunk (rev 598), compiled under VS2008 and then, replacing resulting DLL in FF4. I have upgraded MS SDK Platform and DX SDK to the latest one, having a Win7 64bits PC with a nVidia GTX485. BTW, what is different in Chrome 12? doesn't it use regular SVN-trunk Angle?

Chrome 12 does use regular SVN-trunk ANGLE (as far as I know), however it uses a different build system (GYP) that has some different defines. See http://code.google.com/p/angleproject/source/browse/trunk/src/build_angle.gyp for the relevant settings.

Hope this helps,

Daniel

---

Daniel Koch -+- dan...@transgaming.com

Senior Graphics Architect -+- TransGaming Inc. -+- www.transgaming.com

Alvaro Segura

unread,

Mar 30, 2011, 6:21:11 PM3/30/11

to angleproject

You are right. That's why we propose to find a way to apply a smart
selective change.

Let me state the problem in a different way:

We are not talking about saving several seconds as in the flight.html
demo. In fact, we faced this problem more severely. Our application,
with moderate iteration settings takes more than 1 minute trying to
compile (with a few "your script is taking too long" alerts) and
finally fails. I don't have the numbers here, I think we have to hit
"stop script" after minutes.

Our fix makes compilation possible, and in only a couple seconds, the
same as native GLSL. Rendering itself is acceptably fast after that
for such a shader.

So, yes, a rule has to be chosen to apply the tex2Dlod() trick only
when necessary that does not break the rest of cases. Some
suggestions:

- texture reads inside loops
- texture reads inside loops longer than N iterations (N=10?)
- texture reads inside loops when the texture has a MIN_FILTER=LINEAR
or NEAREST, ...

That last idea might be the safest. Is it possible in Angle to know
the texture filtering mode? (texParameteri TEXTURE_MIN_FILTER) If the
mode for the relevant texture is LINEAR or NEAREST (not _MIPMAP_) then
using tex2Dlod is just fine, right?

Thanks for your attention.

Best regards,

Alvaro

> Chrome 12 does use regular SVN-trunk ANGLE (as far as I know), however it uses a different build system (GYP) that has some different defines. Seehttp://code.google.com/p/angleproject/source/browse/trunk/src/build_a...for the relevant settings.

Nicolas Capens

unread,

Mar 31, 2011, 5:59:22 AM3/31/11

to ase...@vicomtech.org, angleproject

Hi Alvaro,

The GLSL ES specification does not guarantee all loops with dynamic indexing
to compile/link successfully. See sections 10.25, 10.35, and Appendix A item
5. So strictly speaking ANGLE is conformant and even certain code which does
compile successfully with ANGLE may not work as expected on other platforms.

That said, the reason HLSL does not allow functions which compute
screen-space gradients inside dynamic loops is because these gradients would
be unpredictable when neighboring pixels take different branches. Unrolled
loops don't suffer from this because even if logically different branches
are taken, all paths are still guaranteed to be executed so there are
meaningful gradients.

The GLSL ES spec doesn't say how gradients should be computed when an
implementation does support dynamic loops, but the desktop GLSL spec
(section 8.10.1) states that "derivatives within nonuniform
control flow are undefined". This is a significant caveat because even if
your shader does compile successfully, you're not guaranteed to get the same
result on different platforms. Therefore it's questionable how useful this
extended ability is in the first place. Note that HLSL really does a best
effort at creating repeatable results, but you simply might run out of
instruction slots supported by the hardware.

In theory non-mipmapped texture lookups could be supported by implementing
them using tex2Dlod, but this would significantly complicate things because
there can be many different versions of the same shader. In my humble
opinion the effort does not outweigh the advantage. As far as I can tell
only convoluted tech demos might use this functionality. I sincerely doubt
that any Shader Model 3.0 game on the market had to resort to using tex2Dlod
because it required texture lookups in dynamic loops...

Anyhow, there might be a way around the HLSL limitation to get the same
undefined behavior as GLSL, by manipulating the shader binary. As far as I
know, assembly shaders don't impose any limits on the use of instructions
which compute gradients. So you could use tex2Dlod (texldl) everywhere, and
then after the shader binary has been generated, change them back to regular
texld instructions. The binary format is well defined:
http://msdn.microsoft.com/en-us/library/ff552891(v=VS.85).aspx. If you can
get this to work without drawbacks, we might consider exposing it as an
extension. It has a high hackyness level though, so no guarantees.

Cheers,

Nicolas

Aras Pranckevicius

unread,

Mar 31, 2011, 6:24:56 AM3/31/11

to angleproject

That said, the reason HLSL does not allow functions which compute
screen-space gradients inside dynamic loops is because these gradients would
be unpredictable when neighboring pixels take different branches.

Yeah. And knowing how GPUs work, I would it find very surprising if this behavior wouldn't be the same everywhere. Inside a dynamic branch the derivatives (and hence mipmapping) is out of the window, so HLSL restrictions make perfect sense.

So does the suggestion at translating texture2D to test2Dlod with zero mip level when inside a dynamic branch/loop that can't be predicated/unrolled.

I sincerely doubt that any Shader Model 3.0 game on the market had to resort to using tex2Dlod
because it required texture lookups in dynamic loops...

I could say that a lot of SM3.0 games use tex2Dlod. Mostly in postprocessing effects where dynamic branches or large loops are used, and they always sample a screenspace buffer anyway. All shaders I've seen on that use tex2Dlod because HLSL just fails compiling otherwise... don't underestimate the complexity of modern postprocessing shaders ;)

--
Aras Pranckevičius
work: http://unity3d.com
home: http://aras-p.info

John Davis

unread,

Mar 31, 2011, 8:06:59 AM3/31/11

to nic...@transgaming.com, ase...@vicomtech.org, angleproject

> I sincerely doubt that any Shader Model 3.0 game on the market had to resort to using tex2Dlod because it required texture lookups in dynamic loops...

That's a pretty broad/sweeping statement. And I believe inaccurate as well ...

http://http.developer.nvidia.com/GPUGems/gpugems_ch05.html

see section 5.5 - Good Fake Noise

So I use the above technique to create noise, then using a loop I combine multiple octaves to create turbulence. Please don't tell me game makers aren't doing this. This sort of technique is also an excellent way to dynamically generate textures rather than have WebGL download 100MB of textures through the internet.

JD

Daniel Koch

unread,

Mar 31, 2011, 10:39:37 AM3/31/11

to ase...@vicomtech.org, angleproject

On 2011-03-30, at 6:21 PM, Alvaro Segura wrote:

You are right. That's why we propose to find a way to apply a smart
selective change.

Let me state the problem in a different way:

We are not talking about saving several seconds as in the flight.html
demo. In fact, we faced this problem more severely. Our application,
with moderate iteration settings takes more than 1 minute trying to
compile (with a few "your script is taking too long" alerts) and
finally fails. I don't have the numbers here, I think we have to hit
"stop script" after minutes.

What happens if you reduce the optimization level?

Currently Chrome builds of ANGLE use D3DCOMPILE_OPTIMIZATION_LEVEL0, but stand-alone builds will default to

D3DCOMPILE_OPTIMIZATION_LEVEL3.

If you are doing your own ANGLE build try modifying this at:

http://code.google.com/p/angleproject/source/browse/trunk/src/libGLESv2/Program.cpp#19

Our fix makes compilation possible, and in only a couple seconds, the
same as native GLSL. Rendering itself is acceptably fast after that
for such a shader.

So, yes, a rule has to be chosen to apply the tex2Dlod() trick only
when necessary that does not break the rest of cases. Some
suggestions:

- texture reads inside loops
- texture reads inside loops longer than N iterations (N=10?)
- texture reads inside loops when the texture has a MIN_FILTER=LINEAR
or NEAREST, ...

That last idea might be the safest. Is it possible in Angle to know
the texture filtering mode? (texParameteri TEXTURE_MIN_FILTER) If the
mode for the relevant texture is LINEAR or NEAREST (not _MIPMAP_) then
using tex2Dlod is just fine, right?

The libGLESv2 part of ANGLE knows the filtering mode, but the shader compiler currently has no knowledge of it.

Adding something like this option would introduce a state dependency between the shaders and textures/texture state.

This would required checking these dependencies every draw call or when a texture state changed and potentially recompiling the shader (resulting in even more delays!). Introducing state dependencies into the shaders is definitely something we hope to avoid.

John Davis

unread,

Apr 9, 2011, 9:42:51 PM4/9/11

to dan...@transgaming.com, ase...@vicomtech.org, angleproject, public webgl

Did we ever find a solution for texture lookups in loops causing slow compile times?

Kenneth Russell

unread,

Apr 11, 2011, 4:45:07 PM4/11/11

to nea...@gmail.com, Alvaro Segura, angleproject

On Thu, Mar 31, 2011 at 3:24 AM, Aras Pranckevicius <nea...@gmail.com> wrote:
>> That said, the reason HLSL does not allow functions which compute
>> screen-space gradients inside dynamic loops is because these gradients
>> would
>> be unpredictable when neighboring pixels take different branches.
>
> Yeah. And knowing how GPUs work, I would it find very surprising if this
> behavior wouldn't be the same everywhere. Inside a dynamic branch the
> derivatives (and hence mipmapping) is out of the window, so HLSL
> restrictions make perfect sense.
> So does the suggestion at translating texture2D to test2Dlod with zero mip
> level when inside a dynamic branch/loop that can't be predicated/unrolled.

This sounds like the best approach to me, and given the severity of
the problem, I think this is a workaround we should implement in ANGLE
as soon as possible.

Alvaro, would it be possible for you to provide a patch for your
changes to ANGLE implementing this alternate translation? It would be
ideal if you could produce and upload your patch using gclient / gcl
so that it can be easily reviewed.

Thanks,

-Ken

Alastair Patrick

unread,

Apr 11, 2011, 5:05:37 PM4/11/11

to k...@google.com, Kenneth Russell, nea...@gmail.com, Alvaro Segura, angleproject

I suggest doing what Daniel suggested first. In Program.cpp, you can change the shader optimization level to minimal by switching D3DCOMPILE_OPTIMIZATION_LEVEL3 with D3DCOMPILE_OPTIMIZATION_LEVEL0. This is what Chrome does and it largely solved a shader compilation time issue involving nested loops. Unless the ANGLE_COMPILE_OPTIMIZATION_LEVEL macro is set to something the default is maximum optimization, which can be very slow.

Al

Alvaro Segura

unread,

Apr 11, 2011, 6:25:06 PM4/11/11

to Kenneth Russell, angleproject

Hello,

Tomorrow we'll send this patch. It's a very simple change, and quite a radical fix, BTW, we only did it to test its feasibility. As we discussed before, it is not in itself a definitive solution, lacking the appropriate "conditional" application. We did not go as far as doing such smart compilation.

You may have seen this demo we recently posted in the WebGL list: http://demos.vicomtech.org/volren/ It exhibits these issues though not very severely (the number of iterations is slightly reduced to avoid them). We have similar but more complex applications that just fail to compile in ANGLE, but work with our custom build.

Let's consider that our proposed change disables mipmapping in these cases. Is that a problem? Not really. This kind of shaders taking repeated samples of textures in long loops, at varying UV coordinates are not using textures in the classical "texture mapped surface" way. They use textures as *data sources* that they need to read (particularly continuous data that benefits from hardware interpolation). So the lack of mipmapping here is not only not a problem, but it's probably also a necessity. That is why I tend to think that, in principle, the replacement of text2d with tex2dLod inside loops (longer than N) should not be much of a problem.

I can think of another case of *repeated texture sampling in different points* in "parallax mapping", but as far as I know that is usually hand-unrolled to just one or two iterations.

BTW, I'm now thinking: wouldn't there be a way to *hint* the compiler to favor some code generation or another? Maybe using #pragmas to instruct ANGLE to use one texture sampling instruction (that is not directly allowed in GLSL)? ...

Best regards,

Alvaro

Alvaro Segura

unread,

Apr 11, 2011, 6:32:27 PM4/11/11

to Kenneth Russell, angleproject

I forgot to say:

We did not get to recompile ANGLE with the suggested settings, but if that is what Chrome 12 uses they seem to be beneficial in some cases. The long compile times of long loops (without intensive texture reads) did become faster in Chrome 12 (not instantaneous, but quite fast). For the more complex shaders with a lot of texture reads there was no significant improvement.

We might give that a try anyway.

John Davis

unread,

Apr 11, 2011, 9:01:14 PM4/11/11

to apat...@google.com, k...@google.com, Kenneth Russell, nea...@gmail.com, Alvaro Segura, angleproject

What version of Chrome has this change?

John Davis

unread,

Apr 11, 2011, 9:04:41 PM4/11/11

to ase...@vicomtech.org, Kenneth Russell, angleproject

I think the #pragma approach would be much safer.

John Davis

unread,

Apr 11, 2011, 9:11:11 PM4/11/11

to ase...@vicomtech.org, Kenneth Russell, angleproject

A friend of mine who worked with Lucas Arts on the Xbox Star Wars title told me within their build process the shaders took ~12hrs to compile/link. I don't think there is any getting away from the fact that shaders with significant complexity will take a while to compile.

Eventually we're going to realize we need to surface the compiling to javascript as an async pattern. Much like the XmlHttpRequest onload callback.

Another thing that would be hugely advantageous is a way to cache the binary shaders so they don't have to be recompiled if they haven't changed.

Right now this may seem like a non-issue, but as serious game programmers jump on board, this will become a bigger problem.

Alastair Patrick

unread,

Apr 11, 2011, 9:13:34 PM4/11/11

to unic...@gmail.com, ase...@vicomtech.org, Kenneth Russell, angleproject

Switching to minimal optimization level was Chromium r79191, which I think is only in m12.

John Davis

unread,

Apr 11, 2011, 9:20:05 PM4/11/11

to Alastair Patrick, ase...@vicomtech.org, Kenneth Russell, angleproject

The minimal optimization is a great intermediate fix, and really simple, any chance we could push it out in Chrome 11?

John Bauman

unread,

Apr 11, 2011, 11:29:18 PM4/11/11

to unic...@gmail.com, Alastair Patrick, ase...@vicomtech.org, Kenneth Russell, angleproject

Has anyone tried using tex2Dgrad with ddx/ddy to allow the compiler to do real branching? It works for some cases, although unfortunately the shader compiler seems reluctant to move the ddx out of the loop, so sometimes it avoids branching for no good reason. It looks like solving that would require our own loop invariant code motion on the ddx and ddy and values they depend on.

Alvaro Segura

unread,

Apr 14, 2011, 4:46:35 PM4/14/11

to Kenneth Russell, angleproject

Hello again,

Below is the patch with the little changes to ANGLE that make it compile quickly complex shaders. Basically it changes optimization level 3 to 0, and tex2D to tex2Dlod (warning: it does tex2Dlod *always*, so a condition must be added as we said earlier).

We also mentioned earlier that some more complex shaders fail to compile after more than one minute trying under stock ANGLE. An example of such a case can be seen in the video at the bottom of this page:

http://demos.vicomtech.org/volren/

(Sorry, video only, no live demo)

The video shows a volume rendering display of reflectivity data collected by a weather radar. The added complexity with respect to the medical dataset above is caused by the way the volume is sampled: as the radar rotates at increasing elevations, it samples the space in spherical coordinates, and elevation angles increase in non-uniform steps. That adds operations to sample the data during rendering. Also the transfer function used to calculate final pixel values can be configured interactively with the sliders at the right. And new data from different points in time can be selected with the prev/nextData buttons.

Best regards

Index: OutputHLSL.cpp
===================================================================

--- OutputHLSL.cpp    (revision 611)
+++ OutputHLSL.cpp    (working copy)
@@ -214,7 +214,9 @@
         {
             out << "float4 gl_texture2D(sampler2D s, float2 t)\n"
                    "{\n"
-                   "    return tex2D(s, float2(t.x, 1 - t.y));\n"
+                   "    return tex2Dlod(s, float4(t.x, 1 - t.y, 0, 0));\n"
                    "}\n"
                    "\n";
         }

Index: Program.cpp
===================================================================
--- Program.cpp    (revision 611)
+++ Program.cpp    (working copy)
@@ -16,7 +16,7 @@
#include "libGLESv2/utilities.h"

#if !defined(ANGLE_COMPILE_OPTIMIZATION_LEVEL)
-#define ANGLE_COMPILE_OPTIMIZATION_LEVEL D3DCOMPILE_OPTIMIZATION_LEVEL3
+#define ANGLE_COMPILE_OPTIMIZATION_LEVEL D3DCOMPILE_OPTIMIZATION_LEVEL0
#endif

namespace gl

On Mon, Apr 11, 2011 at 10:45 PM, Kenneth Russell <k...@chromium.org> wrote:

Reply all

Reply to author

Forward