Compiler performance

Gabriel Hackebeil

unread,

Apr 23, 2015, 10:14:01 PM4/23/15

to cryptop...@googlegroups.com

I've run the benchmark tests after compiling Crypto++ 5.6.2 with the latest versions of the Intel compilers (icpc 15.0.2), GNU g++ (4.9 and 5.1), and Clang (Apple LLVM version 6.1.0). I'm basically interested in getting the best possible performance for AES in CTR mode, and the results of the benchmarks were surprising. In my limited understanding, use of the AES-NI instruction set (which I believe only the Intel compiler can utilize) is supposed to provide a big performance boost for AES. The results do not show this. For the AES mode of interest (CTR), the throughput I achieve (as reported by the benchmark test suite) is roughly:

Clang: ~1 GiB/second
Intel: ~1.7 GiB/second
GNU: ~4.1 GiB/second

Can someone explain why GNU has such a huge boost in performance (~2.5x) over Intel, when GNU can not use the AES-NI instructions (I don't care much about Clang)? I get the same results comparing Intel and GNU on a Linux VM. Let me know if you need any relevant machine specs (Intel core i7 cpu). The relevant compiler flags appearing on Linux vs OS X are shown below (the performance results are the same on either operation system and I've played with various optimization flags for each without much change in performance). Can anyone enlighten me about the lack of performance boost from AES-NI?

On Linux VM:
GNU:
$ make CXX=g++
$ g++ -DNDEBUG -g -O2 -march=native -pipe -c ...

Intel:
$ make CXX=icpc
$ icpc -DNDEBUG -g -O2 -wd68 -wd186 -wd279 -wd327 -pipe ...

On OS X:
GNU: (the "-Wa,-q" is to get around assembler errors)
$ make CXX="g++-5.1 -Wa,-q"
$ g++-5.1 -Wa,-q -DNDEBUG -g -O2 -arch x86_64 -DCRYPTOPP_DISABLE_ASM -pipe ...

Intel:
$ make CXX=icpc
$ icpc -DNDEBUG -g -O2 -wd68 -wd186 -wd279 -wd327 -DCRYPTOPP_DISABLE_ASM -c ...

Gabriel Hackebeil

unread,

Apr 24, 2015, 1:00:55 AM4/24/15

to cryptop...@googlegroups.com

My apologies. I was incorrect in assuming only the Intel compiler was using the AES-NI intrinsics. I ran some code profiling on OS X and it showed that both the GNU and the Intel compilers were calling the AES-NI subroutines (Clang was the only compiler that failed to do so).

So to slightly rephrase my question... What could be going on that's causing GNU to outperform Intel so dramatically (any optimization flags to consider)? Also, has anyone else encountered such an unbalanced performance outcome between two compilers on the same machine (both utilizing AES-NI)?

Regards,
Gabe

Mobile Mouse

unread,

Apr 24, 2015, 10:09:46 AM4/24/15

to Gabriel Hackebeil, cryptop...@googlegroups.com

You need to explicitly specify "-maes -mpclmul" for clang to use AES-NI and PCLMUL instructions.

Sent from my iPad

--
--
You received this message because you are subscribed to the "Crypto++ Users" Google Group.
To unsubscribe, send an email to cryptopp-user...@googlegroups.com.
More information about Crypto++ and this group is available at http://www.cryptopp.com.
---
You received this message because you are subscribed to the Google Groups "Crypto++ Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cryptopp-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gabriel Hackebeil

unread,

Apr 24, 2015, 3:38:44 PM4/24/15

to cryptop...@googlegroups.com, gabe...@gmail.com

Unfortunately, adding those flags alone isn't enough to get Crypto++ to use the AES-NI intrinsics when Clang is used as the compiler. Mainly, the macro checking for (CRYPTOPP_GCC_VERSION >= 40400 || _MSC_FULL_VER >= 150030729 || __INTEL_COMPILER >= 1110) fails, so CRYPTOPP_BOOL_AESNI_INTRINSICS_AVAILABLE fails to get activated. Hacking around that is easy enough, but compilation ends up dieing a terrible death like what I'm including at the end of this email. Thus, I think I give up on clang++ for this library.

Regards,
Gabe

clang++ -DNDEBUG -O2 -gdwarf-2 -maes -mpclmul -pipe -c camellia.cpp
In file included from camellia.cpp:16:
./cpu.h:33:64: warning: unknown attribute '__artificial__' ignored [-Wunknown-attributes]

__inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__))
                                                               ^
./cpu.h:40:68: warning: unknown attribute '__artificial__' ignored [-Wunknown-attributes]
__inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
                                                                   ^
./cpu.h:37:7: error: invalid operand for inline asm constraint 'i'
        asm ("pextrd %2, %1, %0" : "=rm"(r) : "x"(a), "i"(i));
             ^
./cpu.h:43:7: error: invalid operand for inline asm constraint 'i'
        asm ("pinsrd %2, %1, %0" : "+x"(a) : "rm"(b), "i"(i));
             ^
Stack dump:
0.    Program arguments: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -cc1 -triple x86_64-apple-macosx10.10.0 -emit-obj -disable-free -disable-llvm-verifier -main-file-name camellia.cpp -mrelocation-model pic -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables -target-cpu core2 -target-feature +aes -target-feature +pclmul -target-linker-version 242 -gdwarf-2 -dwarf-column-info -coverage-file /Users/ghackebeil/Courses/AppliedCrypto_CS519_W2015/DynamicSearchableEncryption/DSE_cpp/Thirdparty/cryptopp562/camellia.o -resource-dir /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/clang/6.1.0 -D NDEBUG -stdlib=libc++ -O2 -fdeprecated-macro -fdebug-compilation-dir /Users/ghackebeil/Courses/AppliedCrypto_CS519_W2015/DynamicSearchableEncryption/DSE_cpp/Thirdparty/cryptopp562 -ferror-limit 19 -fmessage-length 130 -stack-protector 1 -mstackrealign -fblocks -fobjc-runtime=macosx-10.10.0 -fencode-extended-block-signature -fcxx-exceptions -fexceptions -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics -vectorize-loops -vectorize-slp -o camellia.o -x c++ camellia.cpp
1.    <eof> parser at end of file
2.    Code generation
3.    Running pass 'Function Pass Manager' on module 'camellia.cpp'.
4.    Running pass 'Simple Register Coalescing' on function '@_Z16_mm_insert_epi32Dv2_xii'
clang: error: unable to execute command: Segmentation fault: 11
clang: error: clang frontend command failed due to signal (use -v to see invocation)
Apple LLVM version 6.1.0 (clang-602.0.49) (based on LLVM 3.6.0svn)
Target: x86_64-apple-darwin14.3.0
Thread model: posix
clang: note: diagnostic msg: PLEASE submit a bug report to http://developer.apple.com/bugreporter/ and include the crash backtrace, preprocessed source, and associated run script.
clang: note: diagnostic msg:
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang: note: diagnostic msg: /var/folders/p5/4dj9d59x27v7_5_ddgn8r8zm0000gn/T/camellia-031900.cpp
clang: note: diagnostic msg: /var/folders/p5/4dj9d59x27v7_5_ddgn8r8zm0000gn/T/camellia-031900.sh
clang: note: diagnostic msg:

********************
make: *** [camellia.o] Error 254

Gabriel Hackebeil

unread,

Apr 24, 2015, 4:07:07 PM4/24/15

to cryptop...@googlegroups.com, gabe...@gmail.com

Mouse,

I just came across your post from a few months ago about this exact compilation issue (https://groups.google.com/forum/#!searchin/cryptopp-users/Cannot$20compile$20crypto$2B$2B$20%28camellia%29/cryptopp-users/TfkVPnljrzg/DwbHIcbD8O4J). Trying out your patch now.

Gabriel Hackebeil

unread,

Apr 24, 2015, 5:50:20 PM4/24/15

to cryptop...@googlegroups.com, gabe...@gmail.com

The patched worked great and clang performance is now pretty much on par with gcc. It was also necessary to add "-msse4" to the compile flags after applying your patch. The relevant compile scenarios are now

$ g++-5.1 -Wa,-q -DNDEBUG -O2 -arch x86_64 -DCRYPTOPP_DISABLE_ASM -pipe -c ...
$ clang++ -DNDEBUG -O2 -maes -mpclmul -msse4 -DCRYPTOPP_DISABLE_ASM -pipe -c ...
$ icpc -DNDEBUG -O2 -DCRYPTOPP_DISABLE_ASM -pipe -c ...

Intel: ~1.7 GiB/second
Clang: ~3.5 GiB/second
GNU: ~4.1 GiB/second

I still find it surprising that the Intel compiler is being left in the dust here. Maybe Intel is trying to tell me I need to directly use the Intel Performance Primitives Crypto Library for fast AES with their compiler.

Jeffrey Walton

unread,

Apr 26, 2015, 6:23:15 PM4/26/15

to cryptop...@googlegroups.com

On Thursday, April 23, 2015 at 10:14:01 PM UTC-4, Gabriel Hackebeil wrote:

I've run the benchmark tests after compiling Crypto++ 5.6.2 with the latest versions of the Intel compilers (icpc 15.0.2), GNU g++ (4.9 and 5.1), and Clang (Apple LLVM version 6.1.0). I'm basically interested in getting the best possible performance for AES in CTR mode, and the results of the benchmarks were surprising. In my limited understanding, use of the AES-NI instruction set (which I believe only the Intel compiler can utilize) is supposed to provide a big performance boost for AES. The results do not show this. For the AES mode of interest (CTR), the throughput I achieve (as reported by the benchmark test suite) is roughly:

Clang: ~1 GiB/second
Intel: ~1.7 GiB/second
GNU: ~4.1 GiB/second

Can someone explain why GNU has such a huge boost in performance (~2.5x) over Intel, when GNU can not use the AES-NI instructions (I don't care much about Clang)? I get the same results comparing Intel and GNU on a Linux VM. Let me know if you need any relevant machine specs (Intel core i7 cpu). The relevant compiler flags appearing on Linux vs OS X are shown below (the performance results are the same on either operation system and I've played with various optimization flags for each without much change in performance). Can anyone enlighten me about the lack of performance boost from AES-NI?

It may depend on your processor. Are you using an Intel processor, or an AMD (or other) processor.

Intel has been known to do sneaky things, like provide sub-optimal code on non-Intel processors. Way back when when Intel was caught doing this, they did not stop doing it. Rather, they informed developers they were doing it to side step consumer protection laws. See, for example, http://www.agner.org/optimize/blog/read.php?i=49#49 or https://www.google.com/search?q=intel+bad+code+on+amd.

Gabriel Hackebeil

unread,

Apr 26, 2015, 8:53:59 PM4/26/15

to Jeffrey Walton, cryptop...@googlegroups.com

Definitely using an Intel processor (2.6 GHz Intel Core i7), and I heard about this as well. That’s what makes it so surprising.

Gabe

--
--
You received this message because you are subscribed to the "Crypto++ Users" Google Group.
To unsubscribe, send an email to cryptopp-user...@googlegroups.com.
More information about Crypto++ and this group is available at http://www.cryptopp.com.
---

You received this message because you are subscribed to a topic in the Google Groups "Crypto++ Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cryptopp-users/tY6yg7da6kk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cryptopp-user...@googlegroups.com.

Reply all

Reply to author

Forward