[LLVMdev] runtime performance benchmarking tools for clang

Jyoti

unread,

Oct 3, 2013, 4:22:12 AM10/3/13

to llv...@cs.uiuc.edu, cfe-dev@cs.uiuc.edu Developers

Hi All,

Could anyone point me to some good benchmarking tools to measure the runtime performance of clang compiled C++ applications.

Thanks !

- Jyoti

Kun Ling

unread,

Oct 3, 2013, 8:42:06 AM10/3/13

to Jyoti, cfe-dev@cs.uiuc.edu Developers, LLVM Developers Mailing List

Hi Jyoti,

The best benchmark is your application, and since Clang & LLVM have plenty of aggressive optimizations ( some of them may be bug-prone), it also depends on how do you want to improve the performance.

The following is some benchmarks that you could use to evaluate performance of clang.

1. Phoronix have done some performance test using its Phoronix Test Benchmarks (http://www.phoronix-test-suite.com/ ), it includes plenty of commonly used applications. The full list of applications in Phoronix benchmark could be found here: http://openbenchmarking.org/suites/pts

2. For industry standard performance comparison, SPEC CPU is also a good choice. You could find out more here: http://www.spec.org/cpu/ . General Purpose CPU vendors use it to show performance improvements.

3. There are also some other small benchmarks that could test compiler performance, like polybench (http://www.cse.ohio-state.edu/~pouchet/software/polybench/ ), which focus on evaluating the loop transformation of the compiler.

Regards,

Kun Ling

_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

--
http://www.lingcc.com

Jyoti

unread,

Dec 11, 2013, 9:53:50 AM12/11/13

to C. Bergström, llv...@cs.uiuc.edu, cfe-dev@cs.uiuc.edu Developers

Hi Kun Ling & Bergstrom,

Thanks a lot for your earlier responses. We did use the benchmarks in llvm testsuite for comparing execution time taken by clang & gcc. It appears that clang is slower than gcc for cases where floating point operations are involved and recursive calls are involved (note that pic/pie was enabled for both gcc as well as clang ).

1) For lag in execution time due to recursive calls, it was obvious that resolving dynamic relocations via .plt indirections added to the delay. However, it was not clear as to how gcc was able to achieve it in less time than clang, when same libc.so.6 & ld-linux.so.3 were being used for both executables generated by gcc & clang executions.

What could be the possible reason ?

2) For lag in execution time due to floating point operations, it was clearly observed that gcc used floating point instruction FSQRT, where as clang seemed to use emulated function (?) BL SQRT.

Note that we used the following flags for both clang as well as gcc compilation.

-march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3-d16 -mtune=cortex-a8

Infact, i was surprised to see that even when " -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mtune=cortex-a8"

was used, the code generated did not use hardware vsqrt instruction, instead there was a bl sqrt instruction.

Could someone point out why vsqrt was not emited in assembly even though softfp or 'hard' float-abi was specified ?

3) Could you suggest other benchmarks specifically for floating point other than those in llvm testsuite ?

On Thu, Oct 3, 2013 at 6:43 PM, "C. Bergström" <cberg...@pathscale.com> wrote:

On 10/ 3/13 07:42 PM, Kun Ling wrote:

Hi Jyoti,

The best benchmark is your application, and since Clang & LLVM have plenty of aggressive optimizations ( some of them may be bug-prone), it also depends on how do you want to improve the performance.

The following is some benchmarks that you could use to evaluate performance of clang.

1. Phoronix have done some performance test using its Phoronix Test Benchmarks (http://www.phoronix-test-suite.com/ ), it includes plenty of commonly used applications. The full list of applications in Phoronix benchmark could be found here: http://openbenchmarking.org/suites/pts

Hi LK

-1 :P Have you looked at their testsuite and how it's setup? It gives little regard for switching out and tracking the performance of compiler flag changes.

2. For industry standard performance comparison, SPEC CPU is also a good choice. You could find out more here: http://www.spec.org/cpu/ . General Purpose CPU vendors use it to show performance improvements.

Waaay over tuned...

3. There are also some other small benchmarks that could test compiler performance, like polybench (http://www.cse.ohio-state.edu/~pouchet/software/polybench/ <http://www.cse.ohio-state.edu/%7Epouchet/software/polybench/> ), which focus on evaluating the loop transformation of the compiler.

I can't say with absolute certainty, but didn't these favor polyhedral type loop optimizations.
---------------------
You have to decide what types of code you want to benchmark - HPC, C++, scalar/vectorized.. embedded.. etc

If you narrow done what sort of performance comparison - I can offer some suggestions. The above benchmarks aren't bad, but in some cases it won't be a fair comparison against clang/llvm. Other compilers may have done excessive tuning and it'll be reflective compared to a default clang/llvm

Just like NAS parallel benchmark probably has less direct tuning from Intel. If you're looking at embedded maybe Dhrystone....

Lastly - There's some benchmarks in the llvm testsuite to consider
I can't say these are very good choices, but they are probably easy to run
https://llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/

David Peixotto

unread,

Dec 11, 2013, 12:58:10 PM12/11/13

to Jyoti, cfe...@cs.uiuc.edu, llv...@cs.uiuc.edu

2) For lag in execution time due to floating point operations, it was clearly observed that gcc used floating point instruction FSQRT, where as clang seemed to use emulated function (?) BL SQRT.

Note that we used the following flags for both clang as well as gcc compilation.

-march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3-d16 -mtune=cortex-a8

Infact, i was surprised to see that even when " -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mtune=cortex-a8"

was used, the code generated did not use hardware vsqrt instruction, instead there was a bl sqrt instruction.

Could someone point out why vsqrt was not emited in assembly even though softfp or 'hard' float-abi was specified ?

The vsqrt instruction may not be generated when automatically for platforms where math functions may set errno. Try compiling with -fno-math-errno and see if that helps.

Jyoti

unread,

Dec 12, 2013, 2:12:06 AM12/12/13

to David Peixotto, cfe-dev@cs.uiuc.edu Developers, llv...@cs.uiuc.edu

Hi David,

Thanks for your reply.

We enabled -ffast-math which in turn adds -fno-math-errno to clang -cc1 which resulted in SQRT function being replaced with VSQRT instruction and there was an improvement ~40% seen from before for some of the TC.

Still lag exists when compared to gcc though. We are investigating that currently. Any pointers in this direction would help.

Could you suggest some benchmarks specifically for floating point ?

Thanks !

Jyoti Allur

Reply all

Reply to author

Forward