JIT Optimizations Lacking!

Zaugg@discussions.microsoft.com Nathan Zaugg

unread,

Aug 18, 2008, 3:53:02 PM8/18/08

to

I have been doing some research on Managed code v.s. Native code. In my
first battery of tests I have realized that the optimization
performed during the JIT compilation is not nearly as good as the
optimizations done by C++ at compile time. Some of these optimizations
can make a huge difference in performance! My testing also indicates that
ngen does little (if anything) different.

Here are a couple of examples I have found so far:
C# Code:

static void Main(string[] args) {

Stopwatch sw = Stopwatch.StartNew();

int i = 123456789;

// Do Work!
for (int j = 0; j < 1000000000; j++) {
i = (i << 2) + 1525885;
}

sw.Stop();

// Must use the result to prevent it from being optimized out
Console.WriteLine("Result: " + i);
Console.WriteLine("Elapsed: " + sw.Elapsed.TotalMilliseconds);
}

C++ Code:

#include "stdafx.h"
#include "time.h"
#include <windows.h>
#include <iostream>
#include <string>

using namespace std;

int _tmain(int argc, _TCHAR* argv[])
{
//Stopwatch
SYSTEMTIME st;
GetSystemTime( &st );

long ms = st.wMilliseconds + (st.wSecond * 1000) + (st.wMinute * 1000 *
60);

int i = 123456789;

// Do Work!
for(int j=0; j<1000000000; j++) {
i = (i << 2) + 1525885;
}

GetSystemTime(&st);
long ms2 = st.wMilliseconds + (st.wSecond * 1000) + (st.wMinute * 1000 *
60);

// Must use the result to prevent it from being optimized out
cout << "Result: " << i;

long res = ms2 - ms;
cout << "Elapsed Milliseconds: " << res << endl;

return 0;
}

C# Disassembly:

int i = 123456789;
0000005a mov dword ptr [ebp-44h],75BCD15h

// Do Work!
for (int j = 0; j < 1000000000; j++) {
00000061 xor edx,edx
00000063 mov dword ptr [ebp-48h],edx
00000066 nop
00000067 jmp 0000007B
00000069 nop
i = (i << 2) + 1525885;
0000006a mov eax,dword ptr [ebp-44h]
0000006d lea eax,[eax*4+0017487Dh]
00000074 mov dword ptr [ebp-44h],eax
}
00000077 nop
for (int j = 0; j < 1000000000; j++) {
00000078 inc dword ptr [ebp-48h]
0000007b cmp dword ptr [ebp-48h],3B9ACA00h
00000082 setl al
00000085 movzx eax,al
00000088 mov dword ptr [ebp-50h],eax
0000008b cmp dword ptr [ebp-50h],0
0000008f jne 00000069

C++ Disassembly:

int i = 123456789;
01251513 mov dword ptr [i],75BCD15h

// Do Work!
for(int j=0; j<1000000000; j++) {
0125151A mov dword ptr [j],0
01251521 jmp wmain+6Ch (125152Ch)
01251523 mov eax,dword ptr [j]
01251526 add eax,1
01251529 mov dword ptr [j],eax
0125152C cmp dword ptr [j],3B9ACA00h
01251533 jge wmain+84h (1251544h)
i = (i << 2) + 1525885;
01251535 mov eax,dword ptr [i]
01251538 lea ecx,[eax*4+17487Dh]
0125153F mov dword ptr [i],ecx
}
01251542 jmp wmain+63h (1251523h)

Notice that the assembly generated by C++ is far more optimized. It runs in
about 1/4 the time!
There is also a bug in the optimization engine. For example if we remove
the statements that output the result (i)
the C++ code completely removes all of the for loop and returns in 0ms. The
C# version removes the "work" aspect
(it doesn't perform any calculation on i) it still loops through the for
loop! (shown below, built in release mode with optimizations
and without debug symbols):

C# in release mode optimized out i, but not the loop.

009B00B7 inc eax
009B00B8 cmp eax,3B9ACA00h
009B00BD jl 009B00B7

Are there any plans in the works do improve the JIT compilation? I
understand that the goal is to have the compilation complete as fast as
possible,
but code execution speed vs. JIT speed -- give me a choice! If I would
rather take an extra second or two to wait for a more optimized output then
I should have that prerogative!

Thanks!
Nathan Zaugg

Barry Kelly

unread,

Aug 18, 2008, 4:37:44 PM8/18/08

to

Nathan Zaugg <Nathan Za...@discussions.microsoft.com> wrote:

> I have been doing some research on Managed code v.s. Native code. In my
> first battery of tests I have realized that the optimization
> performed during the JIT compilation is not nearly as good as the
> optimizations done by C++ at compile time.

> Here are a couple of examples I have found so far:
> C# Code:

Your benchmark does no meaningful work (i.e. it is unrepresentative of
the vast majority of programs). You can't conclude much of consequence
from it.

> static void Main(string[] args) {
>
> Stopwatch sw = Stopwatch.StartNew();
>
> int i = 123456789;
>
> // Do Work!
> for (int j = 0; j < 1000000000; j++) {
> i = (i << 2) + 1525885;
> }
> sw.Stop();
>
> // Must use the result to prevent it from being optimized out
> Console.WriteLine("Result: " + i);
> Console.WriteLine("Elapsed: " + sw.Elapsed.TotalMilliseconds);

> There is also a bug in the optimization engine. For example if we remove

> the statements that output the result (i)
> the C++ code completely removes all of the for loop and returns in 0ms. The
> C# version removes the "work" aspect

C++ has a preprocessor, and some of the idioms of preprocessor usage
*intentionally* generate code which does no work - i.e. an if-statement
with a condition that is statically false, or a while-loop that is never
entered, etc. These idioms are frequently used for selecting between
debugging builds and optimized release builds, or light logging versus
verbose logging, for example.

On the other hand, C# does not have such a strong tradition of
generating pointless code, as it has other tools at its disposal (JIT in
debug mode versus normal; runtime loading of classes & configuration
rather than compile-time configuration).

So, what you are observing is the result of the developers behind the
CLR JIT and MSVC putting work in where it is likely to have most effect,
rather than trying to optimize useless code.

> Are there any plans in the works do improve the JIT compilation?

Every software company interested in selling the next iteration of its
technology is interested in improving performance; I'm sure the CLR team
is no different in this respect.

Also, bear in mind that the MSVC team have had at least a decade longer
to work on optimizations, and haven't had to live under the constraints
of the JIT.

-- Barry

--
http://barrkel.blogspot.com/

Thomas Scheidegger

unread,

Aug 18, 2008, 5:14:27 PM8/18/08

to

Hi Nathan

> ngen does little (if anything) different

the current ngen was designed to just do pre-compilation,
with code optimization similar to JITing only.
(subject any changes in 2.0SP2)
Thus it eliminates the runtime 'delays' for JITing,
but it will not make any (remarkable) difference in native code performance
http://blogs.msdn.com/clrcodegeneration/archive/2007/09/15/to-ngen-or-not-to-ngen.aspx

> 00000066 nop

this looks like code generated by the debug-mode JITer.
Thus it keeps native code as _close_ to MSIL as possible!
Of course, this code is in no way comparable to code generated by C/C++.

http://www.codeproject.com/KB/dotnet/JITOptimizations.aspx

http://msdn.microsoft.com/en-us/library/ms973852.aspx

And there could even be a common future
for the Visual C++ compiler (backend) and .NET JITer:
http://research.microsoft.com/Phoenix/

https://connect.microsoft.com/content/content.aspx?ContentID=4527&SiteID=214

--
Thomas Scheidegger - 'NETMaster'
http://dnetmaster.net/

Barry Kelly

unread,

Aug 18, 2008, 5:39:28 PM8/18/08

to

Thomas Scheidegger wrote:

> > 00000066 nop
>
> this looks like code generated by the debug-mode JITer.

That's a good point, I didn't look past the sample code and described
behaviour.

For good, non-intrusive discovery of actual generated code, I compile
using csc /optimize+ and disassemble while debugging using Windbg with
the SOS extension and thence !u command.

Nathan Zaugg

unread,

Aug 19, 2008, 1:18:01 AM8/19/08

to

"Barry Kelly" wrote:
> Your benchmark does no meaningful work (i.e. it is unrepresentative of
> the vast majority of programs). You can't conclude much of consequence
> from it.

I was going for a very simple computational example. It should be easy to
argue that the performance between the two compilers should be more or less
equivalent. My point is simply that it is not. I've been working on an
implementation of LZ78 and find it painfully un-optimized. Since there is a
lot of bit shifting happening on my byte-aligned stream this is a great test.
Writing the critical pieces in IL doesn't speed things up any because it's
the JIT compilation that emits the unsatisfactory code.

> C++ has a preprocessor, and some of the idioms of preprocessor usage
> *intentionally* generate code which does no work - i.e. an if-statement
> with a condition that is statically false, or a while-loop that is never
> entered, etc. These idioms are frequently used for selecting between
> debugging builds and optimized release builds, or light logging versus
> verbose logging, for example.
>
> On the other hand, C# does not have such a strong tradition of
> generating pointless code, as it has other tools at its disposal (JIT in
> debug mode versus normal; runtime loading of classes & configuration
> rather than compile-time configuration).
>
> So, what you are observing is the result of the developers behind the
> CLR JIT and MSVC putting work in where it is likely to have most effect,
> rather than trying to optimize useless code.

True, I would expect the C++ compiler to win in most computational tests but
not by a factor of 4! It doesn't matter what mode I build in (release or
debug) I wouldn’t have expected that much of a difference. The frustrating
thing is that I can't tune this code any further -- I have to leave that to
the compiler.

My suggestion is simply that maybe we need a way to compile our C# code a
little more carefully. Perhaps if there was a highly-optimized ngen?

Pavel Minaev

unread,

Aug 19, 2008, 3:42:40 AM8/19/08

to

On Aug 19, 9:18 am, Nathan Zaugg

<NathanZa...@discussions.microsoft.com> wrote:
> True, I would expect the C++ compiler to win in most computational tests but
> not by a factor of 4! It doesn't matter what mode I build in (release or
> debug) I wouldn’t have expected that much of a difference. The frustrating
> thing is that I can't tune this code any further -- I have to leave that to
> the compiler.

The build mode doesn't matter here - if you attach to an executable
built in "Release" mode using VS debugger (I'm assuming that's how
you've got the disassembly), then the JIT itself automatically enters
its own "debug mode", disabling most optimizations. Please do read the
links provided in the earlier posts in this thread for explanations of
this behavior, and how to perform correct measures.

Barry Kelly

unread,

Aug 19, 2008, 7:50:26 AM8/19/08

to

Nathan Zaugg wrote:

> True, I would expect the C++ compiler to win in most computational tests but
> not by a factor of 4!

It doesn't win by a factor of 4. It wins by a factor of about 1.4. The
results of my tests follow - some of them are interesting. In
particular, GNU C calculates the result of 'i' after the loop at compile
time, so the code runs rather quickly.

My machine is Q6600 running at stock 2.4 GHz.

Using 'csc /optimize+' with csc ver. 3.5.30729.1 / CLR ver.
2.0.50727.3053:

---8<---
using System;
using System.Diagnostics;

class App
{
static void Main()
{
Test();
Test();
Test();
}

static void Test()
{
Stopwatch sw = Stopwatch.StartNew();

int i = 123456789;

// Do Work!
for (int j = 0; j < 1000000000; j++)
{
i = (i << 2) + 1525885;
}
sw.Stop();

// Must use the result to prevent it from being optimized out
Console.WriteLine("Result: " + i);
Console.WriteLine("Elapsed: " + sw.Elapsed.TotalMilliseconds);
}
}

--->8---

Here is the C++ I used, using 'cl /O2' with cl ver. 15.00.30729.01:

---8<---
#include <windows.h>
#include <stdio.h>

static void Test()
{
LARGE_INTEGER freq;
if (!QueryPerformanceFrequency(&freq))
exit(1);

LARGE_INTEGER start;
QueryPerformanceCounter(&start);

int i = 123456789;

// Do Work!
for(int j=0; j<1000000000; j++) {

i = (i << 2) + 1525885;
}

LARGE_INTEGER end;
QueryPerformanceCounter(&end);

printf("Result: %d\n", i);

printf("Elapsed milliseconds: %.0f\n",
1000 * (end.QuadPart - start.QuadPart)
/ (double) freq.QuadPart);
}

int main()
{
Test();
return 0;
}
--->8---

The output from the C# version is as follows:

Result: 1431147137
Elapsed: 608.3645
Result: 1431147137
Elapsed: 610.7727
Result: 1431147137
Elapsed: 607.7868

The output from the C++ version is as follows:

Result: 1431147137
Elapsed milliseconds: 420

For good measure, I converted the C# into Java, and compiled and ran it
with JDK 1.5.0_05 and server VM. The output was:

Result: 1431147137
Elapsed: 456.072906 ms
Result: 1431147137
Elapsed: 456.225851 ms
Result: 1431147137
Elapsed: 505.085121 ms

... which seems more competitive with the C++ version. The exact Java
code I used is at the bottom.

Using 'g++ -O3 -mno-cygwin' for Cygwin g++ (GCC) 3.4.4:

Result: 1431147137
Elapsed milliseconds: 17

---8<---
public class Simple2
{
public static void main(String[] args) {
test();
test();
test();
}

static void test() {
long start = System.nanoTime();

int i = 123456789;
// Do Work!
for (int j = 0; j < 1000000000; j++) {
i = (i << 2) + 1525885;
}

long end = System.nanoTime();

// Must use the result to prevent it from being optimized out

System.out.printf("Result: %d\n", i);

System.out.printf("Elapsed: %f ms\n",
(end - start) / (double) 1000000);
}
}
--->8---

Nathan Zaugg

unread,

Aug 19, 2008, 12:09:03 PM8/19/08

to

Ouch, the Java VM outperformed the CLR?

You were right -- the simple act of attaching a debugger to CLR code forces
it to not optimize. One heck of a gotchya! Though in my opinion there still
isn't any good reason for C++ to outperform C# by a statistically significant
margin under these conditions. In theory C# should be almost as fast or
faster at just about everything (especially Memory Management).

I'm writing this for my blog for two reasons, I've hit up against some
performance issues (as mentioned before) and I wanted to know how much
performance I'd be missing out on by using C# over C++.

http://interactiveasp.net/blogs/natesstuff/archive/2008/08/11/managed-vs-unmanaged-round-1-theoretical.aspx

Thanks for help!
Nathan Zaugg

"Barry Kelly" wrote:

> .... which seems more competitive with the C++ version. The exact Java

Jon Skeet [C# MVP]

unread,

Aug 19, 2008, 2:12:04 PM8/19/08

to

Nathan Zaugg <Natha...@discussions.microsoft.com> wrote:
> Ouch, the Java VM outperformed the CLR?

That happens in some cases. In other cases it's the other way round. It
shouldn't come as much surprise, as the JVM has had a lot of work put
into it for rather longer than the CLR.

--
Jon Skeet - <sk...@pobox.com>
Web site: http://www.pobox.com/~skeet
Blog: http://www.msmvps.com/jon.skeet
C# in Depth: http://csharpindepth.com

Pavel Minaev

unread,

Aug 20, 2008, 1:39:39 AM8/20/08

to

On Aug 19, 8:09 pm, Nathan Zaugg

<NathanZa...@discussions.microsoft.com> wrote:
> Ouch, the Java VM outperformed the CLR?

To the best of my knowledge, this has always been the case. Java
HotSpot is extremely aggressive in its optimizations, going so far as
stack-allocating objects when it can get away with it. However, the
JIT phase is slower as the result, which is why HotSpot does not JIT-
compile every method that it runs - on the first execution, bytecode
is interpreted directly, and only when the method is repeatedly hit,
it is JITted.

In contrast, .NET does not have a bytecode interpreter at all, so
everything has to be compiled. For this reason, .NET JIT is more
optimized for its own speed, and less so for performance of its
output.

On the other hand, in my practice, .NET GC has shown noticeably better
performance than Java one, and this, combined with existence of value
types in the language and their use by the standard libraries, could
well be enough to offset the JIT speed difference in any moderately
large program.

Also, 3.5 SP1 has quite a few optimizations in the runtime, including
JIT (the big one is that they've finally implemented inlining of
methods that take structs as arguments!), so it would be interesting
to look at SP1 specifically.

Thomas Scheidegger

unread,

Aug 20, 2008, 2:12:49 AM8/20/08

to

> 3.5 SP1 has quite a few optimizations in the runtime, including
> JIT (the big one is that they've finally implemented inlining of
> methods that take structs as arguments!), so it would be interesting
> to look at SP1 specifically.

Vance Morrison's Weblog
To Inline or not to Inline: That is the question
<URL:http://blogs.msdn.com/vancem/archive/2008/08/19/to-inline-or-not-to-inline-that-is-the-question.aspx>

Pavel Minaev

unread,

Aug 20, 2008, 2:26:47 AM8/20/08

to

On Aug 19, 3:50 pm, Barry Kelly <barry.j.ke...@gmail.com> wrote:
> It doesn't win by a factor of 4. It wins by a factor of about 1.4.

I've disassembled both versions, and here's what it boils down to. The
Visual C++ version of the loop is:

mov esi, 123456789
mov eax, 1000000000
$LL3@Test:
sub eax, 1
lea esi, DWORD PTR [esi*4+1525885]
jne SHORT $LL3@Test

So it reversed the loop to count down to zero, and use the not-zero
check on the jump. And here's .NET JIT (3.5 SP1):

mov edi,75BCD15h
xor edx,edx
jmp 00e300d1
00e300c9:
lea edi,[edi*4+17487Dh]
inc edx
00e300d1:
cmp edx,3B9ACA00h
setl al
movzx eax,al
test eax,eax
jne 00e300c9

On one hand, it did optimize the arithmetics to LEA. On the other
hand, it did not reverse the loop. Also, for whatever reason, it does
a lot of unnecessary conversions during comparison - where CMP/JL
would be perfectly sufficient, it does CMP/SETL/MOVZX/TEST/JNE. I
wonder what's it about - it would seem that basic comparisons should
be JITted better (especially when it's such an obvious thing to do,
and when any other compiler out there gets it properly).

Pavel Minaev

unread,

Aug 20, 2008, 2:47:57 AM8/20/08

to

On Aug 20, 10:26 am, Pavel Minaev <int...@gmail.com> wrote:
> On one hand, it did optimize the arithmetics to LEA. On the other
> hand, it did not reverse the loop. Also, for whatever reason, it does
> a lot of unnecessary conversions during comparison - where CMP/JL
> would be perfectly sufficient, it does CMP/SETL/MOVZX/TEST/JNE. I
> wonder what's it about - it would seem that basic comparisons should
> be JITted better (especially when it's such an obvious thing to do,
> and when any other compiler out there gets it properly).

Okay, it was my mistake - I forgot to feed /optimize+ to csc.exe, so
it did not generate particularly good IL either; and, apparently, the
JIT assumes that IL is optimized in the first place, and does not try
to work around it. Here's what it is with /optimize:

0000001d mov edi,75BCD15h
00000022 xor eax,eax
00000024 lea edi,[edi*4+0017487Dh]
0000002b inc eax
0000002c cmp eax,3B9ACA00h
00000031 jl 00000024

The only reason why this is slower than C++ is that it counts up
rather than down, and therefore uses CMP/JL. If we manually rewrite
the loop to count down, like this:

for (int j = 1000000000; j != 0; --j)

{
i = (i << 2) + 1525885;
}

then here's what the JIT makes of it:

0000001d mov edi,75BCD15h
00000022 mov eax,3B9ACA00h
00000027 lea edi,[edi*4+0017487Dh]
0000002e dec eax
0000002f jne 00000027

which is the same as C++ version, and does indeed run at the same
speed.

Brian Gideon

unread,

Aug 22, 2008, 10:01:13 PM8/22/08

to

On Aug 18, 2:53 pm, Nathan Zaugg <Nathan

Bit manipulation is one area where I would like to see improvement to
the JIT. For example, there are real life scenarios where counting
the number of set bits in a field accounts for a signficant percentage
of the CPU time. Different CPUs have different instructions for
performing that task while others completely lack that instruction and
must use a sequence of more primitive instructions inside a loop. It
would be nice if there were..say..a BitManipulator class whose methods
were optimized by the JIT for a specific CPU at runtime.

Pavel Minaev

unread,

Aug 25, 2008, 10:02:48 AM8/25/08

to

On Aug 23, 6:01 am, Brian Gideon <briangid...@yahoo.com> wrote:
> Bit manipulation is one area where I would like to see improvement to
> the JIT. For example, there are real life scenarios where counting
> the number of set bits in a field accounts for a signficant percentage
> of the CPU time. Different CPUs have different instructions for
> performing that task while others completely lack that instruction and
> must use a sequence of more primitive instructions inside a loop. It
> would be nice if there were..say..a BitManipulator class whose methods
> were optimized by the JIT for a specific CPU at runtime.

Well, BitArray has methods to count bits, but it's not efficient by
itself. BitVector32 is just a thin wrapper around int, but it doesn't
have bit counters. It would sure be nice if BitVector32 had them.