I needed to switch from g++ to VC++ and I have started with Express
Edition 2005. However I see that when I compile my codes with loops,
the code that VC++ Express Edition 2005 makes the code three times
slower than g++ 3.3. See
http://groups.google.com/group/sci.math.num-analysis/browse_frm/thread/0b303788cb8b2e52
People on sci.math.num-analysis say that Express Edition does not
include the complete optimization for C++. Is this correct? What
Edition I would need to have it?
Best wishes,
Evgenii
AFAIK, VC++ EE is shipped with the same compiler as Professional
Edition, which definitely has optimizer. You can see detailed
feature comparison here:
"Visual C++ Editions"
http://msdn.microsoft.com/en-us/library/hs24szh9.aspx
Could you show concise example of a problematic code?
Alex
> I needed to switch from g++ to VC++ and I have started with Express
> Edition 2005. However I see that when I compile my codes with loops,
> the code that VC++ Express Edition 2005 makes the code three times
> slower than g++ 3.3. See
[...]
> People on sci.math.num-analysis say that Express Edition does not
> include the complete optimization for C++. Is this correct? What
> Edition I would need to have it?
I think that the Express Edition has the same core optimizing compiler of
the Professional and above editions.
(Maybe the only thing that is not present in the Express Edition is the
profile guided optimization.)
If you want, I can build your code with Visual C++ 2008 Professional, and
give you back the .exe, so you can do your tests.
Giovanni
> AFAIK, VC++ EE is shipped with the same compiler as Professional
> Edition, which definitely has optimizer. You can see detailed
> feature comparison here:
AFAIK, the VC200*5* had not the optimizing compiler; but the VC2008 has it.
--
Greetings
Jochen
My blog about Win32 and .NET
http://blog.kalmbachnet.de/
That would explain the slow code.
Alex
My code is
http://matrixprogramming.com/MatrixMultiply/code/2direct/
This is matrix mulitplication via naive simple three loops. My results
and short description are at
http://groups.google.com/group/sci.math.num-analysis/browse_frm/thread/0b303788cb8b2e52
Well, the other people in this thread say that Visual C++ 2005 EE does
not have the full optimizer but Visual C++ 2008 EE has it. Thank you
for this information. I do appreciate it.
Unfortunately this piece of information is not correct. It used to be
that the free versions of the compiler were limited, but that is not
so for the 2005 and 2008 versions.
Bo Persson
Maybe he has iterator debugging turned on...
> My code is
>
> http://matrixprogramming.com/MatrixMultiply/code/2direct/
I built your C++ code using VS2008, in Release mode (optimized for speed).
You can download the .exe from here (it is stored in the .zip archive):
http://www.geocities.com/giovanni.dicanio/temp/MatrixTest-VC.zip
(I did a static linking of C/C++ run-time, so you should not have
problems for manifests and deployment in
general).
> Well, the other people in this thread say that Visual C++ 2005 EE does
> not have the full optimizer but Visual C++ 2008 EE has it. Thank you
> for this information. I do appreciate it.
Instead, it seems that Bo agrees with me.
However, you can download VS2008 EE, and compare with the built I did using
VS2008 Pro.
HTH,
Giovanni
> I built your C++ code using VS2008, in Release mode (optimized for speed).
>
> You can download the .exe from here (it is stored in the .zip archive):
>
> http://www.geocities.com/giovanni.dicanio/temp/MatrixTest-VC.zip
I was curious, and I built the code using VC6 (+ SP6), too.
I updated the above archive with both the .exe's (the one built using VC2008
and the one built using VC6).
My benchmark is (on Intel Core 2 Duo, 2.4GHz, 2 GB RAM):
VC2008:
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 9.765 s
VC6:
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 5.578 s
So:
VC2008:
Size: 138 KB
Time: 9.765 s
VC6:
Size: 104 KB
Time: 5.578 s
...it seems that VC6's result is better, in both size and speed.
Giovanni
> My code is
>
> http://matrixprogramming.com/MatrixMultiply/code/2direct/
>
> This is matrix mulitplication via naive simple three loops.
If you want more high speed matrix operations, you may consider Blitz++
library:
http://www.oonumerics.org/blitz/
it uses advanced C++ template metaprogramming techniques to achieve high
speed.
HTH,
Giovanni
This is the URL for the original version of your source code (no VC6
comparison):
http://www.geocities.com/giovanni.dicanio/temp/MatrixTest-Original.zip
Giovanni
I got the following on a 2.16GHz Intel core2 with 2GB RAM using Visual
Studio Professional 2008 in a win32 project:
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 2.028000 s
> I got the following on a 2.16GHz Intel core2 with 2GB RAM using Visual
> Studio Professional 2008 in a win32 project:
>
> time for C(1000,1000) = A(1000,1000) B(1000,1000) is 2.028000 s
Very good.
Did you use any particular settings for optimization?
(I used default one for release builds: Maximize Speed /O2)
Giovanni
I used full optimisation but 'optimse for speed' did just as well.
This program uses an unusually large amount of space on the stack so I
reserved a large virtual stack space of 30,000,000MB (set in the linker). I
suspect that the poor timings may be caused by paging issues.
I hope not - I have used -O2.
Thank you for the suggestion. I know that this way is not efficient to
compute matrix multiplication. Actually this is a part of my text
http://matrixprogramming.com/MatrixMultiply/
where I have tried to show that even in such a simple case it is good
to use libraries, that is the optimized BLAS. Note that in the direct
three loops implementation the bottleneck is the memory. See the
comparison for three different computers at the end of the page.
I have run your benchmarks on a computer at the middle in the table 1
(at the end of the link above).
$ ./MatrixVC9Original.exe 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 25.516 s
$ ./MatrixTestVC6 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 18.266 s
$ ./MatrixTestVC9 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 23.094 s
At the same time g++ 3.3 under Cygwin produces the code that
$ make direct-cc.exe
g++ -s -O3 direct.cc -o direct-cc.exe
direct-cc.exe 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 13.984 s
Funny. It seems that VC++ makes something strange in this case. Once
more, the bottleneck is the memory, so it could be not the best way to
compare different compilers. Still, it is really funny.
Thanks a lot for your efforts.
Evgenii
Have you compiled C or C++ code? C++ code does not use the stack, but
C does. I have done it for simplicity, as I write more often in C++.
Well, the stack size in this case is just to allocate 3 matrices
1000x1000 but this fits well memory of modern computers. Why there
should be a paging issue?
While not very useful, it's still possible to set optimizations in a debug
build.
So I don't think "optimize for speed" and DEBUG are mutually exclusive.
There's also "checked iterators" that is a bit different and doesn't require
debug.
http://msdn2.microsoft.com/en-US/library/aa985965(VS.80).aspx
http://msdn2.microsoft.com/en-US/library/aa985982(VS.80).aspx.
I get the same results in both C and C++. Both C and C++ use the stack for
local variables.
When the image is loaded it will only have enough memory allocated for the
default stack set by VC++. If the program runs out of stack space there
will be a paging fault and the OS will allocate more pages for the stack and
then re-enter the progam. If the amount of stack needed is very large it
can ask for more pages many times and this will make the program a lot
slower.
> $ ./MatrixVC9Original.exe 1000
> time for C(1000,1000) = A(1000,1000) B(1000,1000) is 25.516 s
>
> $ ./MatrixTestVC6 1000
> time for C(1000,1000) = A(1000,1000) B(1000,1000) is 18.266 s
>
> $ ./MatrixTestVC9 1000
> time for C(1000,1000) = A(1000,1000) B(1000,1000) is 23.094 s
>
> At the same time g++ 3.3 under Cygwin produces the code that
>
> $ make direct-cc.exe
> g++ -s -O3 direct.cc -o direct-cc.exe
> direct-cc.exe 1000
> time for C(1000,1000) = A(1000,1000) B(1000,1000) is 13.984 s
VC6 seems the faster in the Visual C++ family.
And g++ seems even faster...
Frankly speaking, I can't understand that.
Maybe this is a particular case in which g++ and VC6 do a better job than
the "big brothers" (like VC9).
Or it could be possible that VC9 does more run-time security checkings (than
VC6 and g++...), so the code runs slower...
> Funny. It seems that VC++ makes something strange in this case. Once
> more, the bottleneck is the memory, so it could be not the best way to
> compare different compilers. Still, it is really funny.
Yes. I believe that this is absolutely *not* a "scientific" benchmark to
compare the quality of C++ compilers, of course. :)
...But, yes, as you write, it is kind of "funny".
> Thanks a lot for your efforts.
You're welcome.
Thank you for offering us this interesting opportunity of analysis.
Giovanni
Thanks a lot for the links. This was the case indeed. It happens that
by default by using cl from the command line _SECURE_SCL is defined
and equal to 1. The next command has solved the problem
cl -EHsc -O2 -D_SECURE_SCL=0 -DUSECLOCK direct.cc
Once more, thanks a lot, Duane.
I am a bit surprised that such an option is by default on. Well.
Anyway, I am happy that the solution is found.
Evgenii
This is not true. If you look at matrix.h
http://matrixprogramming.com/MatrixMultiply/code/2direct/
you see that I use vector<double> to keep the matrix data and it does
not use the stack but rather allocates memory at the heap.
> When the image is loaded it will only have enough memory allocated for the
> default stack set by VC++. If the program runs out of stack space there
> will be a paging fault and the OS will allocate more pages for the stack and
> then re-enter the progam. If the amount of stack needed is very large it
> can ask for more pages many times and this will make the program a lot
> slower.
The timing is done when the memory is already allocated and the
matrices are populated. So memory allocation either on the stack or on
the heap does not affect the timing.
As I have just written
cl -EHsc -O2 -D_SECURE_SCL=0 -DUSECLOCK direct.cc
solves the problem. Thanks to Duane Hebert. Iterators are safe by
default and as a result by default the performance suffers. Really a
strange choice.
I was referring to C and C++ in general, not to your specific code.
Both languages make use of both the stack and the heap.
>> When the image is loaded it will only have enough memory allocated for the
>> default stack set by VC++. If the program runs out of stack space there
>> will be a paging fault and the OS will allocate more pages for the stack and
>> then re-enter the progam. If the amount of stack needed is very large it
>> can ask for more pages many times and this will make the program a lot
>> slower.
>
> The timing is done when the memory is already allocated and the
> matrices are populated. So memory allocation either on the stack or on
> the heap does not affect the timing.
A and B will be allocated outside the timing loop but C allocations
occur inside the loop.
> As I have just written
>
> cl -EHsc -O2 -D_SECURE_SCL=0 -DUSECLOCK direct.cc
>
> solves the problem. Thanks to Duane Hebert.
Yes, I read that.
I suspected there was security checking involved, in fact I wrote that in my
previous post:
>> Or it could be possible that VC9 does more run-time security checkings
>> (than
>> VC6 and g++...), so the code runs slower...
However, I was not aware of the flag that Duane correctly mentioned.
> Iterators are safe by
> default and as a result by default the performance suffers. Really a
> strange choice.
I think that VC++ Team valued security instead of performance.
I think that Microsoft is paying lot of attention to code security in recent
years.
So, I don't think it is a strange choice, it's just a choice.
Giovanni
FWIW this setting has been useful to us for finding hard to find
problems.
I've never noticed that much of a bottleneck in the
larger scheme of things but then again, I don't have a lot of code
like the OP posted.
The matrix C is also allocated outside of the loop in both C and C++
versions. Memory allocation within a loop for matrix multiplication
would be a disaster.
The reasoning is that if you can figure out what the setting does, you
can also figure out how to turn it off, if you want to.
Those who can't, are the ones that really need it enabled by default.
Bo Persson
Well, when I use an option to optimize for speed (-O2 in the case of
VC), I expect the compiler to make a code optimized for speed. And it
happens that in the case of VC++ this is not the case. Look at the
documentation for -O2: there is nothing there about safe iterators. In
my view, this is quite confusing. This is what actually I wanted to
say.
Imagine how slow the safe iterators would be if you didn't optimize
for speed <g>
I would define that as a kind of "documentation bug".
I would expect a reference to _SECURE_SCL in optimize for speed -O2
documentation.
I was not aware of that preprocessor macro. It's easy to find things when
you already know about their name :)
There was a similar thing here about C++ 'new':
"The new and delete Operators"
http://msdn.microsoft.com/en-us/library/kftdy56f.aspx
In the official MSDN documentation in that page there was no reference to
nothrow option of new.
Fortunately, Carl Daniel (who knew the nothrow option), added a community
content.
But IMHO that reference to nothrow should have been already in the official
documentation.
Giovanni
I understand. Nobody says that this is good, just that it is the best
scheme found so far. :-)
If security features were disabled by default, those who need them the
most would probably not know why they should enable them. That's the
problem.
Bo Persson