Dear developers, Hi. Thank you for developing the gemmlowp library. I just tested gemmlowp on my computer, and tested sgemm of openblas on the same condition for comparison. It seems gemmlowp's performance is being much slower than openblas'. The following is the detail of results. I was wondering is this the performance that should be expected?
Thank you very much.
Ziye
To test openblas' gemm on the same condition with gemmlowp, the function 'time_for_gemms' in benchmark.cc is modified.
while (true) {
double starttime = time();
for (int i = 0; i < iters_at_a_time; i++) {
for (size_t j = 0; j < gemms.size(); j++) {
blasint m = gemms[j].rows;
blasint n = gemms[j].depth;
blasint k = gemms[j].cols;
// new_a, new_b, new_c are buffers large enough.
GEMM (&trans, &trans, &m, &n, &k, alpha, new_a, &m, new_b, &k, beta, new_c, &m);
}
}
double endtime = time();
Test result for openblas:Benchmarking small model GEMMs...
running for 10 seconds...
Graph latency (over 47 iterations):
Best: 0.00642519s
Worst: 0.00739394s
Mean: 0.00682319s
25% trimmed mean: 0.00677253s
Mean of 10% best: 0.00646243s
Benchmarking typical GoogLeNet GEMMs...
running for 20 seconds...
Graph latency (over 70 iterations):
Best: 0.0172201s
Worst: 0.0460801s
Mean: 0.0195106s
25% trimmed mean: 0.0185641s
Mean of 10% best: 0.0174235s
Benchmarking default mode (typically multi-threaded)...
10x10x10 : 8.28 GFlops/s
20x20x20 : 22.53 GFlops/s
30x30x30 : 25.91 GFlops/s
40x40x40 : 41.56 GFlops/s
50x50x50 : 41.92 GFlops/s
60x60x60 : 49.6 GFlops/s
64x256x147 : 116.1 GFlops/s
100x100x1 : 5.072 GFlops/s
100x100x100 : 75.89 GFlops/s
100x1000x100 : 103.4 GFlops/s
1000x1000x1 : 9.682 GFlops/s
1000x1000x10 : 92.16 GFlops/s
1000x1000x100 : 276.1 GFlops/s
1000x1000x1000 : 292.7 GFlops/s
Benchmarking single-threaded mode...
10x10x10 : 8.266 GFlops/s
20x20x20 : 22.48 GFlops/s
30x30x30 : 25.93 GFlops/s
40x40x40 : 41.48 GFlops/s
50x50x50 : 42.11 GFlops/s
60x60x60 : 49.61 GFlops/s
64x256x147 : 69.63 GFlops/s
100x100x1 : 5.066 GFlops/s
100x100x100 : 66.96 GFlops/s
100x1000x100 : 75.58 GFlops/s
1000x1000x1 : 4.964 GFlops/s
1000x1000x10 : 35.86 GFlops/s
1000x1000x100 : 94.5 GFlops/s
1000x1000x1000 : 108.6 GFlops/s
Test result for gemmlowp:Benchmarking small model GEMMs...
running for 10 seconds...
Graph latency (over 36 iterations):
Best: 0.00771596s
Worst: 0.0113729s
Mean: 0.00852033s
25% trimmed mean: 0.00830705s
Mean of 10% best: 0.00787207s
Benchmarking typical GoogLeNet GEMMs...
running for 20 seconds...
Graph latency (over 68 iterations):
Best: 0.0329185s
Worst: 0.0527048s
Mean: 0.0369283s
25% trimmed mean: 0.0363602s
Mean of 10% best: 0.0336597s
Benchmarking default mode (typically multi-threaded)...
10x10x10 : 3.613 GFlops/s
20x20x20 : 8.814 GFlops/s
30x30x30 : 14.39 GFlops/s
40x40x40 : 17.58 GFlops/s
50x50x50 : 17.92 GFlops/s
60x60x60 : 36.2 GFlops/s
64x256x147 : 76.44 GFlops/s
100x100x1 : 3.645 GFlops/s
100x100x100 : 43.48 GFlops/s
100x1000x100 : 85.67 GFlops/s
1000x1000x1 : 15.48 GFlops/s
1000x1000x10 : 87.93 GFlops/s
1000x1000x100 : 143.8 GFlops/s
1000x1000x1000 : 156.5 GFlops/s
Benchmarking single-threaded mode...
10x10x10 : 3.612 GFlops/s
20x20x20 : 8.811 GFlops/s
30x30x30 : 14.33 GFlops/s
40x40x40 : 17.59 GFlops/s
50x50x50 : 17.96 GFlops/s
60x60x60 : 26.4 GFlops/s
64x256x147 : 34.18 GFlops/s
100x100x1 : 3.606 GFlops/s
100x100x100 : 28.54 GFlops/s
100x1000x100 : 38.25 GFlops/s
1000x1000x1 : 3.077 GFlops/s
1000x1000x10 : 13.71 GFlops/s
1000x1000x100 : 19.24 GFlops/s
1000x1000x1000 : 19.68 GFlops/s
Compilation Switches:$ c++ benchmark_gemmlowp.cc -O3 -I../gemmlowp/test -msse4.1 -lpthread -o benchmark_gemmlowp.out
$ g++ -O2 -msse4.1 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=8 -DASMNAME=sgemm -DASMFNAME=sgemm_ -DNAME=sgemm_ -DCNAME=sgemm -DCHAR_NAME=\"sgemm_\" -DCHAR_CNAME=\"sgemm\" -DNO_AFFINITY -I../gemmlowp/test -I../OpenBLAS -c -UCOMPLEX -UDOUBLE -o sgemm.o benchmark_openblas.cc
$ g++ -O2 -msse4.1 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=8 -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I../gemmlowp/test -I.. -o benchmark_gemmgoto.out sgemm.o ../OpenBLAS/libopenblas_haswellp-r0.2.20.dev.a -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1 -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1/../../../../lib -L/lib/../lib -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1/../../.. -lc -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1 -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1/../../../../lib -L/lib/../lib -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1/../../.. -lgfortran -lm -lquadmath -lm -lc -lm
Systems:CPU: Intel Core i7-6700K with frequency locked at 4GHz
Memory: 32 GB
OS: Archlinux 64bit