About performance

373 views

Skip to first unread message

Ziye Fan

unread,

Dec 30, 2016, 7:29:07 AM12/30/16

to gemmlowp

Dear developers, Hi. Thank you for developing the gemmlowp library. I just tested gemmlowp on my computer, and tested sgemm of openblas on the same condition for comparison. It seems gemmlowp's performance is being much slower than openblas'. The following is the detail of results. I was wondering is this the performance that should be expected?

Thank you very much.
Ziye

To test openblas' gemm on the same condition with gemmlowp, the function 'time_for_gemms' in benchmark.cc is modified.

while (true) {
    double starttime = time();
    for (int i = 0; i < iters_at_a_time; i++) {
      for (size_t j = 0; j < gemms.size(); j++) {
        blasint m = gemms[j].rows;
        blasint n = gemms[j].depth;
        blasint k = gemms[j].cols;

// new_a, new_b, new_c are buffers large enough.

        GEMM (&trans, &trans, &m, &n, &k, alpha, new_a, &m, new_b, &k, beta, new_c, &m);
      }
    }
    double endtime = time();

Test result for openblas:

Benchmarking small model GEMMs...
running for 10 seconds...
Graph latency (over 47 iterations):
Best:             0.00642519s
Worst:            0.00739394s
Mean:             0.00682319s
25% trimmed mean: 0.00677253s
Mean of 10% best: 0.00646243s
Benchmarking typical GoogLeNet GEMMs...
running for 20 seconds...
Graph latency (over 70 iterations):
Best:             0.0172201s
Worst:            0.0460801s
Mean:             0.0195106s
25% trimmed mean: 0.0185641s
Mean of 10% best: 0.0174235s
Benchmarking default mode (typically multi-threaded)...
10x10x10 : 8.28 GFlops/s
20x20x20 : 22.53 GFlops/s
30x30x30 : 25.91 GFlops/s
40x40x40 : 41.56 GFlops/s
50x50x50 : 41.92 GFlops/s
60x60x60 : 49.6 GFlops/s
64x256x147 : 116.1 GFlops/s
100x100x1 : 5.072 GFlops/s
100x100x100 : 75.89 GFlops/s
100x1000x100 : 103.4 GFlops/s
1000x1000x1 : 9.682 GFlops/s
1000x1000x10 : 92.16 GFlops/s
1000x1000x100 : 276.1 GFlops/s
1000x1000x1000 : 292.7 GFlops/s

Benchmarking single-threaded mode...
10x10x10 : 8.266 GFlops/s
20x20x20 : 22.48 GFlops/s
30x30x30 : 25.93 GFlops/s
40x40x40 : 41.48 GFlops/s
50x50x50 : 42.11 GFlops/s
60x60x60 : 49.61 GFlops/s
64x256x147 : 69.63 GFlops/s
100x100x1 : 5.066 GFlops/s
100x100x100 : 66.96 GFlops/s
100x1000x100 : 75.58 GFlops/s
1000x1000x1 : 4.964 GFlops/s
1000x1000x10 : 35.86 GFlops/s
1000x1000x100 : 94.5 GFlops/s
1000x1000x1000 : 108.6 GFlops/s

Test result for gemmlowp:

Benchmarking small model GEMMs...
running for 10 seconds...
Graph latency (over 36 iterations):
Best:             0.00771596s
Worst:            0.0113729s
Mean:             0.00852033s
25% trimmed mean: 0.00830705s
Mean of 10% best: 0.00787207s
Benchmarking typical GoogLeNet GEMMs...
running for 20 seconds...
Graph latency (over 68 iterations):
Best:             0.0329185s
Worst:            0.0527048s
Mean:             0.0369283s
25% trimmed mean: 0.0363602s
Mean of 10% best: 0.0336597s
Benchmarking default mode (typically multi-threaded)...
10x10x10 : 3.613 GFlops/s
20x20x20 : 8.814 GFlops/s
30x30x30 : 14.39 GFlops/s
40x40x40 : 17.58 GFlops/s
50x50x50 : 17.92 GFlops/s
60x60x60 : 36.2 GFlops/s
64x256x147 : 76.44 GFlops/s
100x100x1 : 3.645 GFlops/s
100x100x100 : 43.48 GFlops/s
100x1000x100 : 85.67 GFlops/s
1000x1000x1 : 15.48 GFlops/s
1000x1000x10 : 87.93 GFlops/s
1000x1000x100 : 143.8 GFlops/s
1000x1000x1000 : 156.5 GFlops/s

Benchmarking single-threaded mode...
10x10x10 : 3.612 GFlops/s
20x20x20 : 8.811 GFlops/s
30x30x30 : 14.33 GFlops/s
40x40x40 : 17.59 GFlops/s
50x50x50 : 17.96 GFlops/s
60x60x60 : 26.4 GFlops/s
64x256x147 : 34.18 GFlops/s
100x100x1 : 3.606 GFlops/s
100x100x100 : 28.54 GFlops/s
100x1000x100 : 38.25 GFlops/s
1000x1000x1 : 3.077 GFlops/s
1000x1000x10 : 13.71 GFlops/s
1000x1000x100 : 19.24 GFlops/s
1000x1000x1000 : 19.68 GFlops/s

Compilation Switches:

gemmlowp:

$ c++ benchmark_gemmlowp.cc -O3 -I../gemmlowp/test -msse4.1 -lpthread -o benchmark_gemmlowp.out

openblas:

$ g++ -O2 -msse4.1 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=8 -DASMNAME=sgemm -DASMFNAME=sgemm_ -DNAME=sgemm_ -DCNAME=sgemm -DCHAR_NAME=\"sgemm_\" -DCHAR_CNAME=\"sgemm\" -DNO_AFFINITY -I../gemmlowp/test -I../OpenBLAS -c -UCOMPLEX -UDOUBLE -o sgemm.o benchmark_openblas.cc

$ g++ -O2 -msse4.1 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=8 -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I../gemmlowp/test -I.. -o benchmark_gemmgoto.out sgemm.o ../OpenBLAS/libopenblas_haswellp-r0.2.20.dev.a -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1 -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1/../../../../lib -L/lib/../lib -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1/../../.. -lc -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1 -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1/../../../../lib -L/lib/../lib -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1/../../.. -lgfortran -lm -lquadmath -lm -lc -lm

Systems:
CPU: Intel Core i7-6700K with frequency locked at 4GHz
Memory: 32 GB
OS: Archlinux 64bit

Benoit Jacob

unread,

Jan 4, 2017, 10:26:20 AM1/4/17

to Ziye Fan, gemmlowp

Hello,

gemmlowp has only been optimized for mobile CPUs so far: that is why the x86 optimized paths only take advantage of SSE4, not of AVX, AVX2, etc. That is why, other libraries optimized for desktop (non-mobile) x86 hardware, should perform much better than gemmlowp there.

Benoit

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/49d639d4-6a27-4de6-a65e-34cc9c64f8f0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages