optimization in VC++ Express Edition 2005

Evgenii Rudnyi

unread,

May 25, 2008, 7:25:53 AM5/25/08

to

Hello,

I needed to switch from g++ to VC++ and I have started with Express
Edition 2005. However I see that when I compile my codes with loops,
the code that VC++ Express Edition 2005 makes the code three times
slower than g++ 3.3. See

http://groups.google.com/group/sci.math.num-analysis/browse_frm/thread/0b303788cb8b2e52

People on sci.math.num-analysis say that Express Edition does not
include the complete optimization for C++. Is this correct? What
Edition I would need to have it?

Best wishes,

Evgenii

Alex Blekhman

unread,

May 25, 2008, 11:23:27 AM5/25/08

to

"Evgenii Rudnyi" wrote:
> I needed to switch from g++ to VC++ and I have started with
> Express Edition 2005. However I see that when I compile my codes
> with loops, the code that VC++ Express Edition 2005 makes the
> code three times slower than g++ 3.3.

> People on sci.math.num-analysis say that Express Edition does
> not include the complete optimization for C++. Is this correct?
> What Edition I would need to have it?

AFAIK, VC++ EE is shipped with the same compiler as Professional
Edition, which definitely has optimizer. You can see detailed
feature comparison here:

"Visual C++ Editions"
http://msdn.microsoft.com/en-us/library/hs24szh9.aspx

Could you show concise example of a problematic code?

Alex

Giovanni Dicanio

unread,

May 25, 2008, 11:30:04 AM5/25/08

to

"Evgenii Rudnyi" <use...@rudnyi.ru> ha scritto nel messaggio
news:feb9016b-b62d-4ff0...@m36g2000hse.googlegroups.com...

> I needed to switch from g++ to VC++ and I have started with Express
> Edition 2005. However I see that when I compile my codes with loops,
> the code that VC++ Express Edition 2005 makes the code three times
> slower than g++ 3.3. See

[...]

> People on sci.math.num-analysis say that Express Edition does not
> include the complete optimization for C++. Is this correct? What
> Edition I would need to have it?

I think that the Express Edition has the same core optimizing compiler of
the Professional and above editions.
(Maybe the only thing that is not present in the Express Edition is the
profile guided optimization.)

If you want, I can build your code with Visual C++ 2008 Professional, and
give you back the .exe, so you can do your tests.

Giovanni

Jochen Kalmbach [MVP]

unread,

May 25, 2008, 1:18:48 PM5/25/08

to

Hi Alex!

> AFAIK, VC++ EE is shipped with the same compiler as Professional
> Edition, which definitely has optimizer. You can see detailed
> feature comparison here:

AFAIK, the VC200*5* had not the optimizing compiler; but the VC2008 has it.

--
Greetings
Jochen

My blog about Win32 and .NET
http://blog.kalmbachnet.de/

Alex Blekhman

unread,

May 25, 2008, 2:31:43 PM5/25/08

to

"Jochen Kalmbach [MVP]" wrote:
>> AFAIK, VC++ EE is shipped with the same compiler as
>> Professional Edition, which definitely has optimizer.
>

> AFAIK, the VC200*5* had not the optimizing compiler; but the
> VC2008 has it.

That would explain the slow code.

Alex

Evgenii Rudnyi

unread,

May 25, 2008, 2:57:02 PM5/25/08

to

On 25 Mai, 17:30, "Giovanni Dicanio" <giovanni.dica...@invalid.com>
wrote:
> "Evgenii Rudnyi" <use...@rudnyi.ru> ha scritto nel messaggionews:feb9016b-b62d-4ff0...@m36g2000hse.googlegroups.com...

My code is

http://matrixprogramming.com/MatrixMultiply/code/2direct/

This is matrix mulitplication via naive simple three loops. My results
and short description are at

http://groups.google.com/group/sci.math.num-analysis/browse_frm/thread/0b303788cb8b2e52

Well, the other people in this thread say that Visual C++ 2005 EE does
not have the full optimizer but Visual C++ 2008 EE has it. Thank you
for this information. I do appreciate it.

Bo Persson

unread,

May 25, 2008, 6:07:58 PM5/25/08

to

Unfortunately this piece of information is not correct. It used to be
that the free versions of the compiler were limited, but that is not
so for the 2005 and 2008 versions.

Bo Persson

Duane Hebert

unread,

May 25, 2008, 7:59:02 PM5/25/08

to

>> Well, the other people in this thread say that Visual C++ 2005 EE
>> does not have the full optimizer but Visual C++ 2008 EE has it.
>> Thank you for this information. I do appreciate it.
>
> Unfortunately this piece of information is not correct. It used to be that
> the free versions of the compiler were limited, but that is not so for the
> 2005 and 2008 versions.

Maybe he has iterator debugging turned on...

Giovanni Dicanio

unread,

May 26, 2008, 4:13:00 AM5/26/08

to

"Evgenii Rudnyi" <use...@rudnyi.ru> ha scritto nel messaggio

news:696f8e79-b870-465c...@z72g2000hsb.googlegroups.com...

> My code is
>
> http://matrixprogramming.com/MatrixMultiply/code/2direct/

I built your C++ code using VS2008, in Release mode (optimized for speed).

You can download the .exe from here (it is stored in the .zip archive):

http://www.geocities.com/giovanni.dicanio/temp/MatrixTest-VC.zip

(I did a static linking of C/C++ run-time, so you should not have
problems for manifests and deployment in
general).

> Well, the other people in this thread say that Visual C++ 2005 EE does
> not have the full optimizer but Visual C++ 2008 EE has it. Thank you
> for this information. I do appreciate it.

Instead, it seems that Bo agrees with me.

However, you can download VS2008 EE, and compare with the built I did using
VS2008 Pro.

HTH,
Giovanni

Giovanni Dicanio

unread,

May 26, 2008, 4:40:26 AM5/26/08

to

"Giovanni Dicanio" <giovanni...@invalid.com> ha scritto nel messaggio
news:uLBNhiwv...@TK2MSFTNGP03.phx.gbl...

> I built your C++ code using VS2008, in Release mode (optimized for speed).
>
> You can download the .exe from here (it is stored in the .zip archive):
>
> http://www.geocities.com/giovanni.dicanio/temp/MatrixTest-VC.zip

I was curious, and I built the code using VC6 (+ SP6), too.

I updated the above archive with both the .exe's (the one built using VC2008
and the one built using VC6).

My benchmark is (on Intel Core 2 Duo, 2.4GHz, 2 GB RAM):

VC2008:
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 9.765 s

VC6:
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 5.578 s

So:

VC2008:
Size: 138 KB
Time: 9.765 s

VC6:
Size: 104 KB
Time: 5.578 s

...it seems that VC6's result is better, in both size and speed.

Giovanni

Giovanni Dicanio

unread,

May 26, 2008, 4:42:55 AM5/26/08

to

"Evgenii Rudnyi" <use...@rudnyi.ru> ha scritto nel messaggio

news:696f8e79-b870-465c...@z72g2000hsb.googlegroups.com...

> My code is
>
> http://matrixprogramming.com/MatrixMultiply/code/2direct/
>
> This is matrix mulitplication via naive simple three loops.

If you want more high speed matrix operations, you may consider Blitz++
library:

http://www.oonumerics.org/blitz/

it uses advanced C++ template metaprogramming techniques to achieve high
speed.

HTH,
Giovanni

Giovanni Dicanio

unread,

May 26, 2008, 5:34:31 AM5/26/08

to

"Giovanni Dicanio" <giovanni...@invalid.com> ha scritto nel messaggio
news:uLBNhiwv...@TK2MSFTNGP03.phx.gbl...

>
> "Evgenii Rudnyi" <use...@rudnyi.ru> ha scritto nel messaggio
> news:696f8e79-b870-465c...@z72g2000hsb.googlegroups.com...
>
>> My code is
>>
>> http://matrixprogramming.com/MatrixMultiply/code/2direct/
>
> I built your C++ code using VS2008, in Release mode (optimized for speed).

This is the URL for the original version of your source code (no VC6
comparison):

http://www.geocities.com/giovanni.dicanio/temp/MatrixTest-Original.zip

Giovanni

Blind Anagram

unread,

May 26, 2008, 11:54:54 AM5/26/08

to

"Giovanni Dicanio" <giovanni...@invalid.com> wrote in message
news:efqafxwv...@TK2MSFTNGP04.phx.gbl...

I got the following on a 2.16GHz Intel core2 with 2GB RAM using Visual
Studio Professional 2008 in a win32 project:

time for C(1000,1000) = A(1000,1000) B(1000,1000) is 2.028000 s

Giovanni Dicanio

unread,

May 26, 2008, 12:51:10 PM5/26/08

to

"Blind Anagram" <m...@nothere.gov> ha scritto nel messaggio
news:Hu-dnRtbep7KQKfVnZ2dnUVZ8vSdnZ2d@plusnet...

> I got the following on a 2.16GHz Intel core2 with 2GB RAM using Visual
> Studio Professional 2008 in a win32 project:
>
> time for C(1000,1000) = A(1000,1000) B(1000,1000) is 2.028000 s

Very good.

Did you use any particular settings for optimization?
(I used default one for release builds: Maximize Speed /O2)

Giovanni

Blind Anagram

unread,

May 26, 2008, 1:06:33 PM5/26/08

to

"Giovanni Dicanio" <giovanni...@invalid.com> wrote in message

news:O7yeyD1v...@TK2MSFTNGP04.phx.gbl...

I used full optimisation but 'optimse for speed' did just as well.

This program uses an unusually large amount of space on the stack so I
reserved a large virtual stack space of 30,000,000MB (set in the linker). I
suspect that the poor timings may be caused by paging issues.

Evgenii Rudnyi

unread,

May 26, 2008, 3:01:29 PM5/26/08

to

I hope not - I have used -O2.

Evgenii Rudnyi

unread,

May 26, 2008, 3:11:50 PM5/26/08

to

On May 26, 10:42 am, "Giovanni Dicanio" <giovanni.dica...@invalid.com>
wrote:
> "Evgenii Rudnyi" <use...@rudnyi.ru> ha scritto nel messaggionews:696f8e79-b870-465c...@z72g2000hsb.googlegroups.com...

>
> > My code is
>
> >http://matrixprogramming.com/MatrixMultiply/code/2direct/
>
> > This is matrix mulitplication via naive simple three loops.
>
> If you want more high speed matrix operations, you may consider Blitz++
> library:
>
> http://www.oonumerics.org/blitz/
>
> it uses advanced C++ template metaprogramming techniques to achieve high
> speed.
>
> HTH,
> Giovanni

Thank you for the suggestion. I know that this way is not efficient to
compute matrix multiplication. Actually this is a part of my text

http://matrixprogramming.com/MatrixMultiply/

where I have tried to show that even in such a simple case it is good
to use libraries, that is the optimized BLAS. Note that in the direct
three loops implementation the bottleneck is the memory. See the
comparison for three different computers at the end of the page.

I have run your benchmarks on a computer at the middle in the table 1
(at the end of the link above).
$ ./MatrixVC9Original.exe 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 25.516 s

$ ./MatrixTestVC6 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 18.266 s

$ ./MatrixTestVC9 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 23.094 s

At the same time g++ 3.3 under Cygwin produces the code that

$ make direct-cc.exe
g++ -s -O3 direct.cc -o direct-cc.exe
direct-cc.exe 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 13.984 s

Funny. It seems that VC++ makes something strange in this case. Once
more, the bottleneck is the memory, so it could be not the best way to
compare different compilers. Still, it is really funny.

Thanks a lot for your efforts.

Evgenii

Evgenii Rudnyi

unread,

May 26, 2008, 3:16:56 PM5/26/08

to

On May 26, 7:06 pm, "Blind Anagram" <m...@nothere.gov> wrote:
> "Giovanni Dicanio" <giovanni.dica...@invalid.com> wrote in message

Have you compiled C or C++ code? C++ code does not use the stack, but
C does. I have done it for simplicity, as I write more often in C++.
Well, the stack size in this case is just to allocate 3 matrices
1000x1000 but this fits well memory of modern computers. Why there
should be a paging issue?

Duane Hebert

unread,

May 26, 2008, 3:28:15 PM5/26/08

to

"Evgenii Rudnyi" <use...@rudnyi.ru> wrote in message
news:3aeb7570-904a-4478...@x41g2000hsb.googlegroups.com...

While not very useful, it's still possible to set optimizations in a debug
build.
So I don't think "optimize for speed" and DEBUG are mutually exclusive.
There's also "checked iterators" that is a bit different and doesn't require
debug.

http://msdn2.microsoft.com/en-US/library/aa985965(VS.80).aspx
http://msdn2.microsoft.com/en-US/library/aa985982(VS.80).aspx.

Blind Anagram

unread,

May 26, 2008, 3:37:32 PM5/26/08

to

"Evgenii Rudnyi" <use...@rudnyi.ru> wrote in message

news:5a578bf1-ba85-4a64...@79g2000hsk.googlegroups.com...

I get the same results in both C and C++. Both C and C++ use the stack for
local variables.

When the image is loaded it will only have enough memory allocated for the
default stack set by VC++. If the program runs out of stack space there
will be a paging fault and the OS will allocate more pages for the stack and
then re-enter the progam. If the amount of stack needed is very large it
can ask for more pages many times and this will make the program a lot
slower.

Giovanni Dicanio

unread,

May 26, 2008, 5:49:33 PM5/26/08

to

"Evgenii Rudnyi" <use...@rudnyi.ru> ha scritto nel messaggio

news:0423fd44-117f-4bea...@a1g2000hsb.googlegroups.com...

> $ ./MatrixVC9Original.exe 1000
> time for C(1000,1000) = A(1000,1000) B(1000,1000) is 25.516 s
>
> $ ./MatrixTestVC6 1000
> time for C(1000,1000) = A(1000,1000) B(1000,1000) is 18.266 s
>
> $ ./MatrixTestVC9 1000
> time for C(1000,1000) = A(1000,1000) B(1000,1000) is 23.094 s
>
> At the same time g++ 3.3 under Cygwin produces the code that
>
> $ make direct-cc.exe
> g++ -s -O3 direct.cc -o direct-cc.exe
> direct-cc.exe 1000
> time for C(1000,1000) = A(1000,1000) B(1000,1000) is 13.984 s

VC6 seems the faster in the Visual C++ family.
And g++ seems even faster...

Frankly speaking, I can't understand that.

Maybe this is a particular case in which g++ and VC6 do a better job than
the "big brothers" (like VC9).
Or it could be possible that VC9 does more run-time security checkings (than
VC6 and g++...), so the code runs slower...

> Funny. It seems that VC++ makes something strange in this case. Once
> more, the bottleneck is the memory, so it could be not the best way to
> compare different compilers. Still, it is really funny.

Yes. I believe that this is absolutely *not* a "scientific" benchmark to
compare the quality of C++ compilers, of course. :)
...But, yes, as you write, it is kind of "funny".

> Thanks a lot for your efforts.

You're welcome.

Thank you for offering us this interesting opportunity of analysis.

Giovanni

Evgenii Rudnyi

unread,

May 27, 2008, 2:16:43 PM5/27/08

to

Thanks a lot for the links. This was the case indeed. It happens that
by default by using cl from the command line _SECURE_SCL is defined
and equal to 1. The next command has solved the problem

cl -EHsc -O2 -D_SECURE_SCL=0 -DUSECLOCK direct.cc

Once more, thanks a lot, Duane.

I am a bit surprised that such an option is by default on. Well.
Anyway, I am happy that the solution is found.

Evgenii

Evgenii Rudnyi

unread,

May 27, 2008, 2:23:45 PM5/27/08

to

On 26 Mai, 21:37, "Blind Anagram" <m...@nothere.gov> wrote:
...

> > Have you compiled C or C++ code? C++ code does not use the stack, but
> > C does. I have done it for simplicity, as I write more often in C++.
> > Well, the stack size in this case is just to allocate 3 matrices
> > 1000x1000 but this fits well memory of modern computers. Why there
> > should be a paging issue?
>
> I get the same results in both C and C++. Both C and C++ use the stack for
> local variables.

This is not true. If you look at matrix.h

http://matrixprogramming.com/MatrixMultiply/code/2direct/

you see that I use vector<double> to keep the matrix data and it does
not use the stack but rather allocates memory at the heap.

> When the image is loaded it will only have enough memory allocated for the
> default stack set by VC++. If the program runs out of stack space there
> will be a paging fault and the OS will allocate more pages for the stack and
> then re-enter the progam. If the amount of stack needed is very large it
> can ask for more pages many times and this will make the program a lot
> slower.

The timing is done when the memory is already allocated and the
matrices are populated. So memory allocation either on the stack or on
the heap does not affect the timing.

Evgenii Rudnyi

unread,

May 27, 2008, 2:26:46 PM5/27/08

to

On 26 Mai, 23:49, "Giovanni Dicanio" <giovanni.dica...@invalid.com>
wrote:
...

> VC6 seems the faster in the Visual C++ family.
> And g++ seems even faster...
>
> Frankly speaking, I can't understand that.

...

As I have just written

cl -EHsc -O2 -D_SECURE_SCL=0 -DUSECLOCK direct.cc

solves the problem. Thanks to Duane Hebert. Iterators are safe by
default and as a result by default the performance suffers. Really a
strange choice.

Blind Anagram

unread,

May 27, 2008, 3:00:54 PM5/27/08

to

Evgenii Rudnyi wrote:
> On 26 Mai, 21:37, "Blind Anagram" <m...@nothere.gov> wrote:
> ...
>>> Have you compiled C or C++ code? C++ code does not use the stack, but
>>> C does. I have done it for simplicity, as I write more often in C++.
>>> Well, the stack size in this case is just to allocate 3 matrices
>>> 1000x1000 but this fits well memory of modern computers. Why there
>>> should be a paging issue?
>> I get the same results in both C and C++. Both C and C++ use the stack for
>> local variables.
>
> This is not true. If you look at matrix.h
>
> http://matrixprogramming.com/MatrixMultiply/code/2direct/
>
> you see that I use vector<double> to keep the matrix data and it does
> not use the stack but rather allocates memory at the heap.

I was referring to C and C++ in general, not to your specific code.

Both languages make use of both the stack and the heap.

>> When the image is loaded it will only have enough memory allocated for the
>> default stack set by VC++. If the program runs out of stack space there
>> will be a paging fault and the OS will allocate more pages for the stack and
>> then re-enter the progam. If the amount of stack needed is very large it
>> can ask for more pages many times and this will make the program a lot
>> slower.
>
> The timing is done when the memory is already allocated and the
> matrices are populated. So memory allocation either on the stack or on
> the heap does not affect the timing.

A and B will be allocated outside the timing loop but C allocations
occur inside the loop.

Giovanni Dicanio

unread,

May 27, 2008, 3:28:01 PM5/27/08

to

"Evgenii Rudnyi" <use...@rudnyi.ru> ha scritto nel messaggio

news:c5674bf2-19d3-4c71...@2g2000hsn.googlegroups.com...

> As I have just written
>
> cl -EHsc -O2 -D_SECURE_SCL=0 -DUSECLOCK direct.cc
>
> solves the problem. Thanks to Duane Hebert.

Yes, I read that.

I suspected there was security checking involved, in fact I wrote that in my
previous post:

>> Or it could be possible that VC9 does more run-time security checkings
>> (than
>> VC6 and g++...), so the code runs slower...

However, I was not aware of the flag that Duane correctly mentioned.

> Iterators are safe by
> default and as a result by default the performance suffers. Really a
> strange choice.

I think that VC++ Team valued security instead of performance.
I think that Microsoft is paying lot of attention to code security in recent
years.
So, I don't think it is a strange choice, it's just a choice.

Giovanni

Duane Hebert

unread,

May 27, 2008, 3:42:13 PM5/27/08

to

>> Iterators are safe by
>> default and as a result by default the performance suffers. Really a
>> strange choice.
>
> I think that VC++ Team valued security instead of performance.
> I think that Microsoft is paying lot of attention to code security in
> recent years.
> So, I don't think it is a strange choice, it's just a choice.

FWIW this setting has been useful to us for finding hard to find
problems.

I've never noticed that much of a bottleneck in the
larger scheme of things but then again, I don't have a lot of code
like the OP posted.

Evgenii Rudnyi

unread,

May 27, 2008, 3:54:05 PM5/27/08

to

The matrix C is also allocated outside of the loop in both C and C++
versions. Memory allocation within a loop for matrix multiplication
would be a disaster.

Bo Persson

unread,

May 28, 2008, 1:04:49 PM5/28/08

to

Evgenii Rudnyi wrote:
>
> Thanks a lot for the links. This was the case indeed. It happens
> that by default by using cl from the command line _SECURE_SCL is
> defined and equal to 1. The next command has solved the problem
>
> cl -EHsc -O2 -D_SECURE_SCL=0 -DUSECLOCK direct.cc
>
> Once more, thanks a lot, Duane.
>
> I am a bit surprised that such an option is by default on. Well.

The reasoning is that if you can figure out what the setting does, you
can also figure out how to turn it off, if you want to.

Those who can't, are the ones that really need it enabled by default.

Bo Persson

Evgenii Rudnyi

unread,

May 28, 2008, 4:58:25 PM5/28/08

to

Well, when I use an option to optimize for speed (-O2 in the case of
VC), I expect the compiler to make a code optimized for speed. And it
happens that in the case of VC++ this is not the case. Look at the
documentation for -O2: there is nothing there about safe iterators. In
my view, this is quite confusing. This is what actually I wanted to
say.

Duane Hebert

unread,

May 28, 2008, 5:02:00 PM5/28/08

to

>Well, when I use an option to optimize for speed (-O2 in the case of
>VC), I expect the compiler to make a code optimized for speed. And it
>happens that in the case of VC++ this is not the case. Look at the
>documentation for -O2: there is nothing there about safe iterators. In
>my view, this is quite confusing. This is what actually I wanted to
>say.

Imagine how slow the safe iterators would be if you didn't optimize
for speed <g>

Giovanni Dicanio

unread,

May 28, 2008, 5:05:58 PM5/28/08

to

"Evgenii Rudnyi" <use...@rudnyi.ru> ha scritto nel messaggio

news:64c65161-f6d1-4570...@2g2000hsn.googlegroups.com...

I would define that as a kind of "documentation bug".

I would expect a reference to _SECURE_SCL in optimize for speed -O2
documentation.

I was not aware of that preprocessor macro. It's easy to find things when
you already know about their name :)

There was a similar thing here about C++ 'new':

"The new and delete Operators"
http://msdn.microsoft.com/en-us/library/kftdy56f.aspx

In the official MSDN documentation in that page there was no reference to
nothrow option of new.
Fortunately, Carl Daniel (who knew the nothrow option), added a community
content.
But IMHO that reference to nothrow should have been already in the official
documentation.

Giovanni

Bo Persson

unread,

May 28, 2008, 5:14:21 PM5/28/08

to

I understand. Nobody says that this is good, just that it is the best
scheme found so far. :-)

If security features were disabled by default, those who need them the
most would probably not know why they should enable them. That's the
problem.

Bo Persson