About performance

370 views
Skip to first unread message

Ziye Fan

unread,
Dec 30, 2016, 7:29:07 AM12/30/16
to gemmlowp
Dear developers, Hi. Thank you for developing the gemmlowp library. I just tested gemmlowp on my computer, and tested sgemm of openblas on the same condition for comparison. It seems gemmlowp's performance is being much slower than openblas'. The following is the detail of results. I was wondering is this the performance that should be expected?

Thank you very much.
Ziye

To test openblas' gemm on the same condition with gemmlowp, the function 'time_for_gemms' in benchmark.cc is modified.
  while (true) {
    double starttime = time();
    for (int i = 0; i < iters_at_a_time; i++) {
      for (size_t j = 0; j < gemms.size(); j++) {
        blasint m = gemms[j].rows;
        blasint n = gemms[j].depth;
        blasint k = gemms[j].cols;
        // new_a, new_b, new_c are buffers large enough.
        GEMM (&trans, &trans, &m, &n, &k, alpha, new_a, &m, new_b, &k, beta, new_c, &m);
      }
    }
    double endtime = time();


Test result for openblas:
Benchmarking small model GEMMs...
running for 10 seconds...
Graph latency (over 47 iterations):
  Best:             0.00642519s
  Worst:            0.00739394s
  Mean:             0.00682319s
  25% trimmed mean: 0.00677253s
  Mean of 10% best: 0.00646243s
Benchmarking typical GoogLeNet GEMMs...
running for 20 seconds...
Graph latency (over 70 iterations):
  Best:             0.0172201s
  Worst:            0.0460801s
  Mean:             0.0195106s
  25% trimmed mean: 0.0185641s
  Mean of 10% best: 0.0174235s
Benchmarking default mode (typically multi-threaded)...
10x10x10 : 8.28 GFlops/s
20x20x20 : 22.53 GFlops/s
30x30x30 : 25.91 GFlops/s
40x40x40 : 41.56 GFlops/s
50x50x50 : 41.92 GFlops/s
60x60x60 : 49.6 GFlops/s
64x256x147 : 116.1 GFlops/s
100x100x1 : 5.072 GFlops/s
100x100x100 : 75.89 GFlops/s
100x1000x100 : 103.4 GFlops/s
1000x1000x1 : 9.682 GFlops/s
1000x1000x10 : 92.16 GFlops/s
1000x1000x100 : 276.1 GFlops/s
1000x1000x1000 : 292.7 GFlops/s

Benchmarking single-threaded mode...
10x10x10 : 8.266 GFlops/s
20x20x20 : 22.48 GFlops/s
30x30x30 : 25.93 GFlops/s
40x40x40 : 41.48 GFlops/s
50x50x50 : 42.11 GFlops/s
60x60x60 : 49.61 GFlops/s
64x256x147 : 69.63 GFlops/s
100x100x1 : 5.066 GFlops/s
100x100x100 : 66.96 GFlops/s
100x1000x100 : 75.58 GFlops/s
1000x1000x1 : 4.964 GFlops/s
1000x1000x10 : 35.86 GFlops/s
1000x1000x100 : 94.5 GFlops/s
1000x1000x1000 : 108.6 GFlops/s

Test result for gemmlowp:

Benchmarking small model GEMMs...
running for 10 seconds...
Graph latency (over 36 iterations):
  Best:             0.00771596s
  Worst:            0.0113729s
  Mean:             0.00852033s
  25% trimmed mean: 0.00830705s
  Mean of 10% best: 0.00787207s
Benchmarking typical GoogLeNet GEMMs...
running for 20 seconds...
Graph latency (over 68 iterations):
  Best:             0.0329185s
  Worst:            0.0527048s
  Mean:             0.0369283s
  25% trimmed mean: 0.0363602s
  Mean of 10% best: 0.0336597s
Benchmarking default mode (typically multi-threaded)...
10x10x10 : 3.613 GFlops/s
20x20x20 : 8.814 GFlops/s
30x30x30 : 14.39 GFlops/s
40x40x40 : 17.58 GFlops/s
50x50x50 : 17.92 GFlops/s
60x60x60 : 36.2 GFlops/s
64x256x147 : 76.44 GFlops/s
100x100x1 : 3.645 GFlops/s
100x100x100 : 43.48 GFlops/s
100x1000x100 : 85.67 GFlops/s
1000x1000x1 : 15.48 GFlops/s
1000x1000x10 : 87.93 GFlops/s
1000x1000x100 : 143.8 GFlops/s
1000x1000x1000 : 156.5 GFlops/s

Benchmarking single-threaded mode...
10x10x10 : 3.612 GFlops/s
20x20x20 : 8.811 GFlops/s
30x30x30 : 14.33 GFlops/s
40x40x40 : 17.59 GFlops/s
50x50x50 : 17.96 GFlops/s
60x60x60 : 26.4 GFlops/s
64x256x147 : 34.18 GFlops/s
100x100x1 : 3.606 GFlops/s
100x100x100 : 28.54 GFlops/s
100x1000x100 : 38.25 GFlops/s
1000x1000x1 : 3.077 GFlops/s
1000x1000x10 : 13.71 GFlops/s
1000x1000x100 : 19.24 GFlops/s
1000x1000x1000 : 19.68 GFlops/s

Compilation Switches:
  • gemmlowp:
$ c++ benchmark_gemmlowp.cc -O3 -I../gemmlowp/test -msse4.1 -lpthread -o benchmark_gemmlowp.out
  • openblas:
$ g++ -O2 -msse4.1 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=8 -DASMNAME=sgemm -DASMFNAME=sgemm_ -DNAME=sgemm_ -DCNAME=sgemm -DCHAR_NAME=\"sgemm_\" -DCHAR_CNAME=\"sgemm\" -DNO_AFFINITY -I../gemmlowp/test -I../OpenBLAS -c -UCOMPLEX -UDOUBLE -o sgemm.o benchmark_openblas.cc
 
$ g++ -O2 -msse4.1 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=8 -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I../gemmlowp/test -I.. -o benchmark_gemmgoto.out sgemm.o ../OpenBLAS/libopenblas_haswellp-r0.2.20.dev.a -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1 -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1/../../../../lib -L/lib/../lib -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1/../../..  -lc   -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1 -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1/../../../../lib -L/lib/../lib -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-pc-linux-gnu/6.2.1/../../..  -lgfortran -lm -lquadmath -lm -lc   -lm

Systems:
CPU: Intel Core i7-6700K with frequency locked at 4GHz
Memory: 32 GB
OS: Archlinux 64bit

Benoit Jacob

unread,
Jan 4, 2017, 10:26:20 AM1/4/17
to Ziye Fan, gemmlowp
Hello,

gemmlowp has only been optimized for mobile CPUs so far: that is why the x86 optimized paths only take advantage of SSE4, not of AVX, AVX2, etc. That is why, other libraries optimized for desktop (non-mobile) x86 hardware, should perform much better than gemmlowp there.

Benoit

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/49d639d4-6a27-4de6-a65e-34cc9c64f8f0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages