http://matrixprogramming.com/MatrixMultiply/code/2direct/
with the goal to see how the compiler optimizes the loops. My commands
to compile and run tests are in the make file compare and
make –f compare
compiles and runs tests with GCC and Visual C++. Below there are
results with gcc 3.3 under Cygwin and Visual C++ Express Edition 2005
at my HP notebook
$ make -f compare
gcc -s -O3 -Wl,--stack=50000000 direct1.c -o direct1-gcc.exe
direct1-gcc.exe
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 2.453000 s
cl -O2 -nologo direct1.c -link -STACK:50000000 -out:direct1-vc.exe
direct1.c
direct1-vc.exe
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 1.984000 s
gcc -s -O3 -Wl,--stack=50000000 direct2.c -o direct2-gcc.exe
direct2-gcc.exe
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 2.047000 s
cl -O2 -nologo direct2.c -link -STACK:50000000 -out:direct2-vc.exe
direct2.c
direct2-vc.exe
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 1.985000 s
g++ -s -O3 direct.cc -o direct-gcc.exe
direct-gcc.exe 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 2 s
cl -EHsc -O2 -nologo -DUSECLOCK direct.cc -link -out:direct-vc.exe
direct.cc
direct-vc.exe 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 6.969 s
One sees that for the C code VC produces a slightly faster code but in
the case of C++ code it is slower more than 3 times. I am new to VC++
and I guess that there are some specific flags to optimize the C++
code. I am searching now in Help but so far unsuccessfully. I would
appreciate any hint in this respect, as it is quite painful to loose a
factor of 3 in a simple loop.
Best wishes,
Evgenii
The C++ version is performing vector allocations. It is not the same
as your other versions which put the data on the stack.
P.S.
If you make them static arrays, you won't need such an awful stack.
Since the size is not dynamic, static arrays make sense here (unless
you want to compare other sizes in which case you should use malloc()
for C).
My G++ performance is not like yours. Here are my timings on 2.2 GHz
AMD running Windows 2003 (32 bit OS):
Your makefile, but CXXFLAG = -s -O3 -DUSECLOCK:
C:\math\matmul>direct-cc.exe 1000 1000 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 9.609 s
Microsoft Visual C++ with flags:
/Ox /Ob2 /Oi /Ot /Oy /GT /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D
"USECLOCK" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MT /Zp16 /GS- /
arch:SSE /Fo"Release\\" /Fd"Release\vc80.pdb" /W4 /nologo /c /Wp64 /
Zi /TP /errorReport:prompt
C:\math\matmul>direct-noprof.exe 1000 1000 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 7.391 s
As above, with profile guided optimization:
C:\math\matmul>direct-profile.exe 1000 1000 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 6.891 s
These use your makefile without changes, but I used gfortran and not
g77:
C:\math\matmul>direct1-c.exe
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 5.125000 s
C:\math\matmul>direct1-f.exe
time for C( 1000 , 1000 ) = A( 1000 ,
1000 ) B( 1000 , 1000 ) is 12.640625 s
C:\math\matmul>direct2-c.exe
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 5.172000 s
C:\math\matmul>direct2-f.exe
time for C( 1000 , 1000 ) = A( 1000 ,
1000 ) B( 1000 , 1000 ) is 5.1406250 s
Here I re-ran the fortran tests with g95:
g95 -s -O3 direct1.f -o direct1-f.exe
direct1-f.exe
time for C( 1000 , 1000 ) = A( 1000 , 1000 ) B( 1000 , 1000 ) is
13.140625 s
g95 -s -O3 direct2.f -o direct2-f.exe
direct2-f.exe
time for C( 1000 , 1000 ) = A( 1000 , 1000 ) B( 1000 , 1000 ) is
5.109375 s
It takes 2 seconds on that same machine to do a 1000x1000 C++ matrix
multiply using Strassen multiplication.
This is true but memory allocation happens only once. To allocate
three arrays should not take too much time. So the difference should
be very small.
> P.S.
> If you make them static arrays, you won't need such an awful stack.
> Since the size is not dynamic, static arrays make sense here (unless
> you want to compare other sizes in which case you should use malloc()
> for C).
You are right. Static arrays would be simpler. But I guess this should
not affect the performance anyway.
> My G++ performance is not like yours. Here are my timings on 2.2 GHz
> AMD running Windows 2003 (32 bit OS):
>
> Your makefile, but CXXFLAG = -s -O3 -DUSECLOCK:
> C:\math\matmul>direct-cc.exe 1000 1000 1000
> time for C(1000,1000) = A(1000,1000) B(1000,1000) is 9.609 s
>
> Microsoft Visual C++ with flags:
> /Ox /Ob2 /Oi /Ot /Oy /GT /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D
> "USECLOCK" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MT /Zp16 /GS- /
> arch:SSE /Fo"Release\\" /Fd"Release\vc80.pdb" /W4 /nologo /c /Wp64 /
> Zi /TP /errorReport:prompt
> C:\math\matmul>direct-noprof.exe 1000 1000 1000
> time for C(1000,1000) = A(1000,1000) B(1000,1000) is 7.391 s
Thanks a lot. I will try these flags. I thought that -O2 includes
everything but it seems not to be the case.
Thank you for the suggestion. I guess that if you call DGEMM at your
computer from ATLAS or other optimized BLAS, you should have less than
one second.
My main goal here was just to see how the compiler optimizes the loops.
Unfortunately your flags did not help at my system. I have the same
difference - about 2 s with g++ more than 6 s with VC++. What versions
of gcc and VC++ do you use? I use gcc 3.3 and VC++ 2005 Express
Edition.
Could it be that Express Edition does not make complete optimization?
dcorbit@DCORBIT64 ~
$ gcc --version
gcc.exe (GCC) 3.2 (mingw special 20020817-1)
Copyright (C) 2002 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There
is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.
dcorbit@DCORBIT64 ~
$ g++ --version
g++.exe (GCC) 3.2 (mingw special 20020817-1)
Copyright (C) 2002 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There
is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.
dcorbit@DCORBIT64 ~
$ gfortran --version
GNU Fortran (GCC) 4.3.0
Copyright (C) 2008 Free Software Foundation, Inc.
GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Fortran
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING
dcorbit@DCORBIT64 ~
$ g95 --version
G95 (GCC 4.0.3 (g95 0.91!) Feb 27 2008)
Copyright (C) 2002-2005 Free Software Foundation, Inc.
G95 comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of G95
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING
dcorbit@DCORBIT64 ~
$
Microsoft Visual Studio 2005
Version 8.0.50727.762 (SP.050727-7600)
Microsoft .NET Framework
Version 2.0.50727 SP1
Installed Edition: Enterprise
Microsoft Visual Basic 2005 77642-113-3000004-41589
Microsoft Visual Basic 2005
Microsoft Visual C# 2005 77642-113-3000004-41589
Microsoft Visual C# 2005
Microsoft Visual C++ 2005 77642-113-3000004-41589
Microsoft Visual C++ 2005
Microsoft Visual J# 2005 77642-113-3000004-41589
Microsoft Visual J# 2005
Microsoft Visual Studio 2005 Tools for Applications
77642-113-3000004-41589
Microsoft Visual Studio 2005 Tools for Applications
Microsoft Visual Studio Tools for Office 77642-113-3000004-41589
Microsoft Visual Studio Tools for the Microsoft Office System
Microsoft Visual Web Developer 2005 77642-113-3000004-41589
Microsoft Visual Web Developer 2005
Microsoft Web Application Projects 2005 77642-113-3000004-41589
Microsoft Web Application Projects 2005
Version 8.0.50727.762
Visual Studio 2005 Team Edition for Developers
77642-113-3000004-41589
Microsoft Visual Studio 2005 Team Edition for Software Developers
Crystal Reports AAC60-G0CSA4B-V7000AY
Crystal Reports for Visual Studio 2005
DevPartner Studio 8.0.0.2999
Compuware DevPartner Studio
Copyright © 2005 Compuware Corporation. All rights reserved.
www.compuware.com
IBM Database Add-Ins 9.1.1.73
IBM Database Add-Ins for Visual Studio 2005. Copyright(c) IBM
Corporation. All rights reserved
Microsoft Visual Studio 2005 Professional Edition - ENU Service Pack 1
(KB926601)
This service pack is for Microsoft Visual Studio 2005 Professional
Edition - ENU.
If you later install a more recent service pack, this service pack
will be uninstalled automatically.
For more information, visit http://support.microsoft.com/kb/926601
Microsoft Visual Web Developer 2005 Express Edition - ENU Service Pack
1 (KB926751)
This service pack is for Microsoft Visual Web Developer 2005 Express
Edition - ENU.
If you later install a more recent service pack, this service pack
will be uninstalled automatically.
For more information, visit http://support.microsoft.com/kb/926751
Neumont ORM Architect 1.0.605.525 2006-05CTP
NORMA - Neumont Object-Role Modeling Architect
Security Update for Microsoft Visual Studio 2005 Professional Edition
- ENU (KB937061)
This Security Update is for Microsoft Visual Studio 2005 Professional
Edition - ENU.
If you later install a more recent service pack, this Security Update
will be uninstalled automatically.
For more information, visit http://support.microsoft.com/kb/937061
WorkflowServer Designer 4.5.2.0
WorkflowServer Designer
> Could it be that Express Edition does not make complete optimization?
Yes, for sure that is a problem. From http://www.thefreecountry.com/compilers/cpp.shtml
we have this:
"Microsoft .NET Framework Software Development Kit (SDK) / Free
Microsoft Visual C++ Compiler
The Microsoft Visual C/C++ command line compiler, along with C#,
VB.NET and JScript.NET, is available from Microsoft for download for
free. You will also need to download the Microsoft Windows Platform
SDK which contains the Windows headers and libraries for the
compilers. The command line compiler (at the time this was written/
reviewed) does not have an optimizer (or at least, not the optimizer
that ships with the Professional version)."
The Intel compiler makes very fast matrix code. I do not have the
latest version.
Thanks a lot. This information is very useful. One could expect that
Express Edition is not fully functional. On the other side, you may
want to update g++ if you use it often. It seems that they have
improved optimization in 3.3. It would be interesting to see what is
going on in gcc 4.
> The Intel compiler makes very fast matrix code. I do not have the
> latest version.
I have also not the latest - 8. However I guess that for very fast
matrix code one needs Intel MKL (Intel optimized BLAS), without it I
do not expect that it will significantly improve the situation.
It happens that this is not the case. I have asked at
microsoft.public.vc.language
http://groups.google.com/group/microsoft.public.vc.language/browse_frm/thread/874563e08c779048
and it happens that VC++ by default uses safe iterators. _SECURE_SCL=0
solves the problem:
cl -EHsc -O2 -D_SECURE_SCL=0 -DUSECLOCK direct.cc
Well.
"Evgenii Rudnyi" <use...@rudnyi.ru> wrote in message
news:cb76fea3-e7a3-42f7...@y21g2000hsf.googlegroups.com...
Bill,
Thanks for your comments. I do not use NET and I have actually found
the reason at microsoft.public.vc.language
http://groups.google.com/group/microsoft.public.vc.language/browse_frm/thread/874563e08c779048
It happens that VC++ by default uses safe iterators. _SECURE_SCL=0
solves the problem:
cl -EHsc -O2 -D_SECURE_SCL=0 -DUSECLOCK direct.cc
So, the answer was that by default VC++ is secure but slow.
Evgenii
Evgenii,
Thank you for the tips to turn off _SECURE_SCL. However, after I turn
off _SECURE_SCL via my project properties, a vector iterator
routine( v.back() ) in OpenMesh that I use in my project reports
access violation.
Am I missing something?
Thx!
coo
Probably, there is a bug in your code. You went outside the bounds of
your container.
if it went outside the bound, it should have been reported when
turning on _SECURE_SCL, but it does not...
curious...
With iterators there could be subtle bugs. Say by chance you may use
iterators belonging to another container. Or something like this.
Presumably _SECURE_SCL should recognize it as well but the theory and
practice do not coincide all the time. It could be also there are some
other defines that will force more checks.