*Abstract*
I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64 using dgemm
(a linear algebra routine, matrix-matrix multiplication).
I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and
almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed.
*Introduction*
I'm a friend of Gotoh Kazushige, the principal developers of GotoBLAS. He told me that
FreeBSD is not suitable OS for scientific computing or high performance computing. He says
(in Japanese and my translation):
> I guess FreeBSD does page coloring, but I don't think FreeBSD considers very large cache
> size which recent CPU has. Support of a very large cache on Linux is still not very will
> sophisticated, but on *BSDs, its worst; they uses too fine memory allocation method,
> so we cannot expect large continuous physical memory allocation.
> Moreover, process scheduling is not so nice as *BSD employs an algorithm that
> changes physical CPUs in turn instead of allocating one core for such kind of jobs.
> Take your own benchmark, and you'll see..
*Result*
Machine: Core i7 920 (42.56-44.8Gflops) / DDR3 1066
OS: FreeBSD 8.0/amd64 and Ubuntu 9.10
GotoBLAS2: 1.13
dgemm result
OS : FLOPS : percent in peak
FreeBSD : 32.0 GFlops : 71%
Ubuntu : 42.0-42.7GFlops : 93.8%-95.3%
Thanks,
-- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/
Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt
I'm not sure if this is the exact issue, but it might be a point
of reference worth investigating:
http://lists.freebsd.org/pipermail/freebsd-hackers/2010-March/031004.html
Thanks,
-Garrett
So, where's the profiling to discover why this is the case?
Also I'm not clear on what constitutes 'theoretical peak performance'
here or how it is being calculated. So figures like these come across as
unscientific.
I'm sure this is something which can be resolved if someone sits down,
profiles the app, and makes the necessary adjustments (e.g.
pthread_setaffinity_np()) to configure CPU affinity, if the lack of it
is pessimizing your friend's app.
The PMC framework is rapidly maturing, and you can use KCacheGrind with
it to visualize context switch overhead.
But I think it's expecting a bit much to post informal results to
-stable, in an expectation of something other thaninformal suggestions
of what may help someone's maths-intensive application.
If there are performance issues, then reproducible results are needed,
as well as some basic profiling effort of the system elements involved,
before people could say anything either way, or offer further help.
cheers,
BMS
With what he said, tweaking memory allocation under FreeBSD and/or
linux would change the performance characteristics and either validate
or disprove his assumptions?
Adrian
> _______________________________________________
> freebsd...@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stabl...@freebsd.org"
>
http://lists.freebsd.org/pipermail/freebsd-hackers/2010-March/031004.html
In short, the Core i7 CPUs have a feature called "TurboBoost" where
the clock speed of one or more cores is boosted when other cores are
idle and in a C2 or C3 sleep status ... if the appropriate power
saving mode isn't active on the system (which I don't think FreeBSD
does by default?), the idle cores are never put into the appropriate
power saving state, and as a result TurboBoost never kicks in...
It _may_ be that Ubuntu configures this correctly whereas FreeBSD does
not (out of the box)?
Of course it may be something else entirely, but worth checking out...
--Antony
Sorry about that, but more important question (for us) is: are you willing to help
us improve in addition to reporting your results?
> *Introduction*
> I'm a friend of Gotoh Kazushige, the principal developers of GotoBLAS. He told me that
> FreeBSD is not suitable OS for scientific computing or high performance computing. He says
> (in Japanese and my translation):
>
>> I guess FreeBSD does page coloring, but I don't think FreeBSD considers very large cache
>> size which recent CPU has.
AFAIK, recent FreeBSD doesn't use page coloring anymore.
>> Support of a very large cache on Linux is still not very will
>> sophisticated, but on *BSDs, its worst; they uses too fine memory allocation method,
>> so we cannot expect large continuous physical memory allocation.
Can your friend provide more explanation about these points in technical terms?
E.g. what kind of support, in his opinion, is needed for very large caches?
Why, in his opinion, the memory needs to be physically contiguous?
Perhaps, he talks about support of large pages (2M) and related improvements in
TLB performance. If so, he (and you) may read about 'superpages' feature of FreeBSD.
I am not sure if it is enabled by default in 8.0, you can check vm.pmap.pg_ps_enabled.
>> Moreover, process scheduling is not so nice as *BSD employs an algorithm that
>> changes physical CPUs in turn instead of allocating one core for such kind of jobs.
>> Take your own benchmark, and you'll see..
Here I can only add an anecdotal 'me too'.
Sometimes I run single-threaded high-cpu programs like ffmpeg transcoding on
otherwise idle system (a bunch of system daemons in background).
And I see that the cpu-consuming process frequently goes back and forth between my
two cores. CPU user loads on the cores are something like 60% vs 40%.
My expectations were that the process would mostly run on one core while the rest
of the threads would mostly run on the other.
I am not sure if that core switching really hurts performance and if there is
something wrong about it. But somehow it seems 'counter-intuitive'.
> *Result*
> Machine: Core i7 920 (42.56-44.8Gflops) / DDR3 1066
> OS: FreeBSD 8.0/amd64 and Ubuntu 9.10
> GotoBLAS2: 1.13
>
> dgemm result
> OS : FLOPS : percent in peak
> FreeBSD : 32.0 GFlops : 71%
> Ubuntu : 42.0-42.7GFlops : 93.8%-95.3%
It would also be get good to learn more about your program.
How much memory does it typically use, how does it allocate it?
Is it single-threaded or not? If not, how many threads does it have and what do
they do, how do they communicate?
--
Andriy Gapon
> This may well be the same sort of issue that was discussed in this thread here:
>
> http://lists.freebsd.org/pipermail/freebsd-hackers/2010-March/031004.html
>
> In short, the Core i7 CPUs have a feature called "TurboBoost" where
> the clock speed of one or more cores is boosted when other cores are
> idle and in a C2 or C3 sleep status ... if the appropriate power
> saving mode isn't active on the system (which I don't think FreeBSD
> does by default?), the idle cores are never put into the appropriate
> power saving state, and as a result TurboBoost never kicks in...
>
> It _may_ be that Ubuntu configures this correctly whereas FreeBSD does
> not (out of the box)?
>
> Of course it may be something else entirely, but worth checking out...
Nakata-san's theoretical performance numbers assume 4 to 4.2 operations
per core per cycle at the nominal (2.66 GHz, non-TurboBoost) clock rate.
(DGEMM is double precision, but I am not familiar enough with scientific
computing or with the Nehalem implementation of SSE to know why it is
four operations per cycle rather than two -- is it because double
precision counts as two FLOPs or is it because of multiple issue?)
TurboBoost runs up to 2.93 GHz on this CPU, so it doesn't fit either the
theoretical peak performance or the performance discrepancy very well.
Michael Poole
Another question is what compilers (what versions of GCC) were used on both system
to compile the program?
--
Andriy Gapon
On 8.0-RELEASE and later, they are. Line 183:
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/amd64/amd64/pmap.c?annotate=1.667.2.12
Commit where they got enabled by default (approx. 16 months ago):
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/amd64/amd64/pmap.c#rev1.646
--
| Jeremy Chadwick j...@parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, USA |
| Making life hard for others since 1977. PGP: 4BD6C0CB |
There's a port archivers/pbzip2, and I am inclined to believe this is a
good benchmark for multi-core performance in real-world usage (with an
appropriate input data set).
BZIP2 is a compression algorithm which is readily applicable to
multicore, because of the nature in which its workload may be partioned
amongst multiple CPU cores. It block-sorts, and it can compress long
runs of input data independently of other CPU threads.
When I used PBZIP2 informally back in January, before advising on
FreeBSD/Xen, I saw largely the results I'd expect to see from such a
workload, and didn't encounter pessimization of benchmark figures.
Informal tests were performed on 8-STABLE at that time.
The OP may well be looking for Newton-Raphson approximations, to the
derivatives involved in his friend's linear algebra system. The point is
that PBZIP2 would also exercise context switches in a real-life workload.
I'd be concerned, as anyone else would be, about benchmarks which
apparently challenge FreeBSD's ability to tackle significant
mathematical workloads. But from what little I understand, from speaking
to David Schultz and others who have been involved with FreeBSD's
floating point performance, on a scientific basis -- without a
scientifically reproducible experiment, I don't see a problem.
Obviously, I am concerned that Nakata-san observes what he regards to be
a problem, and would like to help any way I can.
cheers,
BMS
From: Bruce Simpson <b...@incunabulum.net>
Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
Date: Mon, 12 Apr 2010 10:49:14 +0100
> So, where's the profiling to discover why this is the case?
Ok I'll provide better documentation so that everyone can test it very clearly.
(may take some time...)
> Also I'm not clear on what constitutes 'theoretical peak performance'
> here or how it is being calculated. So figures like these come across
> as unscientific.
Core i7 920 (2.66GHz) constitutes four cores. each core has four floating point operators.
thus; 2.66GHz x 4 x 4 = 42.56Gflops
cf. http://www.intel.com/support/processors/sb/cs-023143.htm
> I'm sure this is something which can be resolved if someone sits down,
> profiles the app, and makes the necessary adjustments
> (e.g. pthread_setaffinity_np()) to configure CPU affinity, if the lack
> of it is pessimizing your friend's app.
might be. we run on the same machine.
> The PMC framework is rapidly maturing, and you can use KCacheGrind
> with it to visualize context switch overhead.
>
> But I think it's expecting a bit much to post informal results to
> -stable, in an expectation of something other thaninformal suggestions
> of what may help someone's maths-intensive application.
BLAS is a basic linear algebra package which is used many applications.
It is also used for top500 http://www.top500.org/
cf. http://www.top500.org/project/introduction
via LINPACK. dgemm is LEVEL 3 BLAS, which is a very good for common PCs
as calculation is CPU intensive.
> If there are performance issues, then reproducible results are needed,
> as well as some basic profiling effort of the system elements
> involved, before people could say anything either way, or offer
> further help.
again, I'll provide better documentation so that everyone can test it very clearly.
(may take some time...)
thanks,
I think this is not the case. I tested TurboBoost on/off on Ubuntu, GotoBLAS
achieved 95% of theoretical perfomance for both cases.
cf. http://www.intel.com/support/processors/sb/cs-023143.htm
and http://blog.goo.ne.jp/nakatamaho/e/86c0f4ac529fd5b530454ed795e6b466 (written in Japanese, tho)
Thanks
From: Antony Mawer <li...@mawer.org>
Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
> Nakata-san's theoretical performance numbers assume 4 to 4.2 operations
> per core per cycle at the nominal (2.66 GHz, non-TurboBoost) clock rate.
> (DGEMM is double precision, but I am not familiar enough with scientific
> computing or with the Nehalem implementation of SSE to know why it is
> four operations per cycle rather than two -- is it because double
> precision counts as two FLOPs or is it because of multiple issue?)
> TurboBoost runs up to 2.93 GHz on this CPU, so it doesn't fit either the
> theoretical peak performance or the performance discrepancy very well.
Hi Michael,
I read http://www.intel.com/support/processors/sb/cs-023143.htm
and TurboBoost on 920 is 2.80GHz.
> why it is four operations per cycle rather than two
It's bit strane to me as well. but I did dgemm operation with m=k=n case and
in this case, flop count would become 2n^3 + 2n^2 (even 2n^3 is okay).
thanks
In my case,
% sysctl vm.pmap.pg_ps_enabled
vm.pmap.pg_ps_enabled: 1
thanks a lot!
From: Jeremy Chadwick <fre...@jdc.parodius.com>
Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
Hi
on Ubuntu
$ gcc -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.4.1-4ubuntu9' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 --program-suffix=-4.4 --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --disable-werror --with-arch-32=i486 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu9)
on FreeBSD
% gcc44 -v
Using built-in specs.
Target: x86_64-portbld-freebsd8.0
Configured with: ./../gcc-4.4-20100330/configure --disable-nls --libdir=/usr/local/lib/gcc44 --libexecdir=/usr/local/libexec/gcc44 --program-suffix=44 --with-as=/usr/local/bin/as --with-gmp=/usr/local --with-gxx-include-dir=/usr/local/lib/gcc44/include/c++/ --with-ld=/usr/local/bin/ld --with-libiconv-prefix=/usr/local --with-system-zlib --disable-libgcj --prefix=/usr/local --mandir=/usr/local/man --infodir=/usr/local/info/gcc44 --build=x86_64-portbld-freebsd8.0
Thread model: posix
gcc version 4.4.4 20100330 (prerelease) (GCC)
thanks
I like FreeBSD, esp. ports, since I'm have been a ports committer for 8 years,
so I'll do what I can do...First step might be reproducible results and provide
better analysis for ports/math/ports/gotoblas.
> From: Michael Poole <mdp...@troilus.org>
> Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
> Date: Mon, 12 Apr 2010 10:06:55 -0400
>
>> Nakata-san's theoretical performance numbers assume 4 to 4.2 operations
>> per core per cycle at the nominal (2.66 GHz, non-TurboBoost) clock rate.
>> (DGEMM is double precision, but I am not familiar enough with scientific
>> computing or with the Nehalem implementation of SSE to know why it is
>> four operations per cycle rather than two -- is it because double
>> precision counts as two FLOPs or is it because of multiple issue?)
>> TurboBoost runs up to 2.93 GHz on this CPU, so it doesn't fit either the
>> theoretical peak performance or the performance discrepancy very well.
>
> Hi Michael,
> I read http://www.intel.com/support/processors/sb/cs-023143.htm
> and TurboBoost on 920 is 2.80GHz.
Ah. I was looking at http://ark.intel.com/Product.aspx?id=37147 .
Given a 2.80 GHz TurboBoost, the 44.8 GFLOPS theoretical performance
number makes sense.
I think the more important point is that TurboBoost on this CPU gives at
most a 10% speedup, so it cannot explain the 25% performance difference.
Michael
Many thanks for interested in.
I used following program to major the FLOPS. I'll provide more in details.
you many need <blas.h> but you can change dgemm_f77 to something else to link
agianst GotoBLAS (ports/math/gotoblas). I think you can use math/atlas but
it takes too long time to compile...
---
#include <complex>
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <sys/resource.h>
#define F77_FUNC(name,NAME) name ## _
#include <blas.h>
#define MAXLOOP 10
unsigned long long microseconds()
{
rusage t;
timeval tv;
getrusage( RUSAGE_SELF, &t );
tv = t.ru_utime;
return ((unsigned long long)tv.tv_sec)*1000000 + tv.tv_usec;
}
double gettimeofday_sec()
{
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + (double)tv.tv_usec*1e-6;
}
int
main()
{
int n;
int incx = 1, incy = 1;
double alpha = 3.14, beta = 2.717;
double dgemmtime, t1, t2, t_1, t_2;
for (n = 3000 ; n < 10000; n=n+100) {
printf("n: %d\n", (int)n);
double *A = new double[n*n];
double *B = new double[n*n];
double *C = new double[n*n];
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
A[i*n+j] = i * j + 1;
B[i*n+j] = (i+1) * (j+1) + 1;
C[i*n+j] = (i+1) - (j+1) + 1;
}
}
t1 = (double)microseconds(); t_1 = gettimeofday_sec();
for (int p = 0 ; p < MAXLOOP; p++ ){
dgemm_f77("n", "n", &n, &n, &n, &alpha, A, &n, B, &n, &beta, C, &n);
}
t2 = (double)microseconds(); t_2 = gettimeofday_sec();
// dgemmtime = (t2 - t1) * 1e-6;
dgemmtime = (t_2 - t_1);
printf("time : %lf or %lf \n", (t2 - t1) * 1e-6, t_2 - t_1);
printf("Mflops : %lf\n", ( 2.0 * (double)n * (double)n * (double)n + 2.0 * (double)n* (double)n )* MAXLOOP / dgemmtime / (1000*1000) );
delete[]C;
delete[]B;
delete[]A;