Compiler optimizations / flags; Intel Compiler

179 views
Skip to first unread message

Joe Duarte

unread,
Jul 1, 2016, 9:56:54 PM7/1/16
to ngx-pagespeed-discuss
Hi all,

Whenever I see build instructions for nginx, including those for ngx_pagespeed, a vanilla make command is used and compiler optimization flags are not discussed. Has anyone tested different compiler optimizations to see if, and to what extent, it impacts runtime performance of the server and pagespeed's performance?

I'm thinking of things like -march='westmere' to set the assumed instruction set to the Westmere generation of CPUs and above. I think this is necessary to use Cloudflare's optimized gzip fork, for example (because of the carry-less multiplication instruction set, which can also be directly specified with -mpclmul) See the GCC doc here.

Then there are the platform-agnostic optimizations here, like -fpeephole2, -fprefetch-loop-arrays and many more. Since an HTTP server has some simple and extremely repetitive workloads, I'm guessing peephole optimizations would help. Same drill for pagespeed in particular. I don't know what's in the default make.config on common platforms, but it's probably what the vanilla nginx makes are using.

The h2o web server is making profitable use of the SSE 4.2 string instructions (well, it's the picohttpparser specifically). I don't think nginx or pagespeed can benefit from those instructions automatically – compilers aren't that good yet. But there are other potential SIMD wins to be had from compiler flags.

Finally, anyone try the Intel Compiler on nginx + pagespeed? It generally beats other compilers for various applications, but I haven't found much on the web re: nginx builds, or pagespeed. Microsoft just rolled out a way to compile C++ (and presumably C?) apps on or for Linux using Visual Studio but I haven't dug into it yet.

My hunch is that if we're not using compiler optimizations or telling the compiler that the target is post-2009 silicon, we're leaving some nginx and pagespeed performance on the table. Pagespeed is a very active module, so there might be some headroom here for optimization (or, Google engineers already wrung out a lot of optimization).

If you've tried different compiler settings, I'd love to hear about the results. (I also read somewhere that default nginx builds we're using all the security / hardening options that they should, but I can't find it. Every post or gist on hardening nginx I've ever seen skips the compiler entirely, and goes straight to config options...)

Cheers,

Joe

Joe Duarte

unread,
Jul 2, 2016, 1:16:47 AM7/2/16
to ngx-pagespeed-discuss


On Friday, July 1, 2016 at 6:56:54 PM UTC-7, Joe Duarte wrote:
Hi all,

If you've tried different compiler settings, I'd love to hear about the results. (I also read somewhere that default nginx builds we're using all the security / hardening options that they should, but I can't find it. Every post or gist on hardening nginx I've ever seen skips the compiler entirely, and goes straight to config options...)


 Typo: It should read "...default nginx builds aren't using all the security / hardening options that they should..." (This is about the fortify source options, PIE and/or PIC (I don't remember the difference, but they're ASLR), and other flags.))

JD 

Hans van Eijsden

unread,
Jul 2, 2016, 5:28:02 PM7/2/16
to ngx-pagespeed-discuss
Hi Joe,

I compile nginx with -O3, -march=native and with -flto. My gcc version is 1.9.2 (Debian Jessie) and those 3 options give me a huge performance increase and CPU-usage decrease.

My full configure command:

NPS_VERSION=1.11.33.2
./configure --add-module=/usr/local/src/ngx_brotli_module --add-module=$HOME/ngx_pagespeed-release-${NPS_VERSION}-beta --prefix=/opt/nginx --user=www-data --group=www-data --with-http_ssl_module --with-http_spdy_module --with-http_v2_module --with-openssl=/usr/local/src/openssl-1.0.2h --with-openssl-opt="enable-ec_nistp_64_gcc_128 threads" --with-md5=/usr/local/src/openssl-1.0.2h --with-md5-asm --with-sha1=/usr/local/src/openssl-1.0.2h --with-sha1-asm --with-pcre-jit --with-file-aio --with-http_flv_module --with-http_geoip_module --with-http_gzip_static_module --with-http_gunzip_module --with-http_mp4_module --with-http_realip_module --with-http_stub_status_module --with-threads --with-ipv6 --with-cc-opt="-DTCP_FASTOPEN=23 -O3 -march=native -flto" --with-ld-opt="-DTCP_FASTOPEN=23 -O3 -march=native -flto"

I have tried icc but by now my license has been expired. It gave me an extra 10% increase with nginx without ngx_pagespeed. I don't know if it's possible to compile ngx_pagespeed with icc though.

- Hans


Op zaterdag 2 juli 2016 03:56:54 UTC+2 schreef Joe Duarte:

Otto van der Schaaf

unread,
Jul 2, 2016, 6:36:38 PM7/2/16
to ngx-pagesp...@googlegroups.com
Hi Hans,

I'm curious, I remember you also used jemalloc in combination with ngx_pagespeed in the past.
How did that end up working for you? 

Otto




--
You received this message because you are subscribed to the Google Groups "ngx-pagespeed-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ngx-pagespeed-di...@googlegroups.com.
Visit this group at https://groups.google.com/group/ngx-pagespeed-discuss.
For more options, visit https://groups.google.com/d/optout.

Hans van Eijsden

unread,
Jul 3, 2016, 4:08:14 AM7/3/16
to ngx-pagespeed-discuss
Good morning Otto, mogguh,

You've got a good memory. Yes, it was me. 😉 

Yes, I used jemalloc on all of my ngx_pagespeed servers and it gave me great results. Less memory usage, less memory fragmentation, an overall shorter response time of applications and everything feels way more polished and smoother.
Currently I'm using ngx_pagespeed on only one of my servers, www.weblogzwolle.nl because we love the image compression. I use jemalloc 4.2.1 with much pleasure. 

I also added this rule to the [Service] part of /etc/systemd/system/nginx.service:
Environment="LD_PRELOAD=/usr/local/lib/libjemalloc.so"

- Hans


Op zondag 3 juli 2016 00:36:38 UTC+2 schreef Otto van der Schaaf:
To unsubscribe from this group and stop receiving emails from it, send an email to ngx-pagespeed-discuss+unsub...@googlegroups.com.

Joe Duarte

unread,
Jul 4, 2016, 2:07:53 AM7/4/16
to ngx-pagesp...@googlegroups.com
Hi Hans,

Awesome! I'm glad to know it makes a difference. I have a feeling that the -march=native and -flto flags have more of an impact than -O3, but there's only one way to find out...

By the way, what is the native arch in your case? Do you compile on your desktop or directly on the production server? I assume that this drives what the compiler treats as the native arch. I'm going to test some of this on VPS providers like DigitalOcean.

And have you tried any of the fast-math flags? It looks like they're all disabled by default, which means the default is strict and pedantic IEEE floating point. I'm not sure how much floating point comes up in modal nginx and pagespeed processes. I was worried that messing with the FP math might somehow screw up OpenSSL, but given that this old Redhat guide and WolfSSL both endorse the fast-math flags, I think it's probably okay to try for Open|Boring|LibreSSL. I'll report back when I have good data.

The other options I'm interested in are the miscellaneous increases-compile-time flags at the bottom of the GCC page (that's not what they're actually called). I don't care about compile time. I'm surprised we don't let compilers run for a week to get the most optimized (and perhaps safest) code for deployment of any given release of a major app. (I dream about a superoptimizing cloud compiler, something at least as powerful as IBM's Watson, that does in a few seconds what it would take GCC or MSVS a week to do. That would be so sweet. I don't know why compilers are assumed to be laptop-borne tools.) The feedback-driven optimizations down there are interesting too – I have a strong hunch that pagespeed will benefit from this.

Cheers,

Joe


--
You received this message because you are subscribed to a topic in the Google Groups "ngx-pagespeed-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ngx-pagespeed-discuss/MaVdqF-BEM4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ngx-pagespeed-di...@googlegroups.com.

Hans van Eijsden

unread,
Jul 4, 2016, 9:19:45 AM7/4/16
to ngx-pagespeed-discuss
Hi Joe,

$ uname -a
Linux vps 4.6.0-0.bpo.1-amd64 #1 SMP Debian 4.6.1-1~bpo8+1 (2016-06-14) x86_64 GNU/Linux

$ cat /proc/cpuinfo 
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel Xeon E312xx (Sandy Bridge)
stepping : 1
microcode : 0x1
cpu MHz : 2399.998
cache size : 4096 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx lm constant_tsc nopl eagerfpu pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm xsaveopt
bugs :
bogomips : 4799.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
(4 CPU cores available)

I compile directly on the production servers (VPS @ Tilaa).
No, I haven't tried the fast-math flags (yet), because I'm not really that familiar with them and I didn't want to enter unknown territory on production servers yet. 😉 
Sounds interesting though!

We share exactly the same thoughts: I also don't care about compile time. Optimizing for the win!
Please report back when you have results.

- Hans


Op maandag 4 juli 2016 08:07:53 UTC+2 schreef Joe Duarte:
To unsubscribe from this group and all its topics, send an email to ngx-pagespeed-discuss+unsub...@googlegroups.com.

Centmin Mod George

unread,
Aug 4, 2016, 11:02:18 AM8/4/16
to ngx-pagespeed-discuss
Been using clang and gcc with intel processor optimised flags + jemalloc for awhile now with my Centmin Mod LEMP stack as i can switch between clang and gcc easily https://community.centminmod.com/posts/33947/

CentOS 6.8 with Linode 4 cpu E5-2680v3 


nginx -V
nginx version: nginx/1.11.3
built by gcc 4.9.1 20140922 (Red Hat 4.9.1-10) (GCC) 
built with LibreSSL 2.4.2
TLS SNI support enabled
configure arguments: --with-ld-opt='-lrt -ljemalloc -Wl,-z,relro -Wl,-rpath,/usr/local/lib' --with-cc-opt='-m64 -mtune=native -mfpmath=sse -g -O3 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2' --sbin-path=/usr/local/sbin/nginx --conf-path=/usr/local/nginx/conf/nginx.conf --with-http_stub_status_module --with-http_secure_link_module --with-openssl-opt=enable-tlsext --add-module=../nginx-module-vts --with-libatomic --with-threads --with-stream=dynamic --with-stream_ssl_module --with-http_gzip_static_module --add-dynamic-module=../ngx_brotli --add-dynamic-module=../ngx_pagespeed-release-1.11.33.2-beta --with-http_sub_module --with-http_addition_module --with-http_image_filter_module=dynamic --with-http_geoip_module=dynamic --with-stream_geoip_module=dynamic --with-http_realip_module --add-dynamic-module=../ngx-fancyindex-0.4.0 --add-module=../ngx_cache_purge-2.3 --add-module=../ngx_devel_kit-0.3.0 --add-module=../set-misc-nginx-module-0.30 --add-module=../echo-nginx-module-0.59 --add-module=../redis2-nginx-module-0.13 --add-module=../ngx_http_redis-0.3.7 --add-module=../memc-nginx-module-0.17 --add-module=../srcache-nginx-module-0.31 --add-module=../headers-more-nginx-module-0.30 --with-pcre=../pcre-8.39 --with-pcre-jit --with-http_ssl_module --with-http_v2_module --with-openssl=../libressl-2.4.2

Irwin L

unread,
Sep 29, 2016, 12:53:36 AM9/29/16
to ngx-pagespeed-discuss
_common_flags=(
  --with-ipv6
  --with-pcre-jit
  --with-file-aio
  --with-http_addition_module
  --with-http_auth_request_module
  --with-http_dav_module
  --with-http_degradation_module
  --with-http_flv_module
  --with-http_geoip_module
  --with-http_gunzip_module
  --with-http_gzip_static_module
  --with-http_mp4_module
  --with-http_realip_module
  --with-http_secure_link_module
  --with-http_ssl_module
  --with-http_stub_status_module
  --with-http_sub_module
  --with-http_v2_module
  --with-stream
  --with-stream_ssl_module
  --with-threads
)

_custom_flags=(
    --with-ld-opt='-lrt -ljemalloc -Wl,-z,relro'

    --with-cc-opt='-m64 -mtune=native -mfpmath=sse -g -O3 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2'
    --with-libatomic
    --with-openssl=../libressl
    --with-openssl-opt=enable-tlsext
    --add-module=../ngx_cache_purge
    --add-dynamic-module=../nginx-module-vts
    --add-dynamic-module=../ngx_pagespeed
    --add-dynamic-module=../ngx-fancyindex
)

    ./configure \
        --prefix=/etc/nginx \
        --conf-path=/etc/nginx/nginx.conf \
        --sbin-path=/usr/bin/nginx \
        --pid-path=/run/nginx.pid \
        --lock-path=/run/lock/nginx.lock \
        --user=http \
        --group=http \
        --http-log-path=/var/log/nginx/access.log \
        --error-log-path=/var/log/nginx/error.log \
        --http-client-body-temp-path=/var/lib/nginx/client-body \
        --http-proxy-temp-path=/var/lib/nginx/proxy \
        --http-fastcgi-temp-path=/var/lib/nginx/fastcgi \
        --http-scgi-temp-path=/var/lib/nginx/scgi \
        --http-uwsgi-temp-path=/var/lib/nginx/uwsgi \
        ${_common_flags[@]} \
        ${_mainline_flags[@]} \
        ${_custom_flags[@]}

gives me:
./configure: error: invalid option "-ljemalloc"

% locate jemalloc
/usr/bin/jemalloc-config
/usr/bin/jemalloc.sh
/usr/include/jemalloc
/usr/include/jemalloc/jemalloc.h
/usr/lib/libjemalloc.so
/usr/lib/libjemalloc.so.2
/usr/lib/libjemalloc_pic.a
/usr/lib/pkgconfig/jemalloc.pc
/usr/share/doc/jemalloc
/usr/share/doc/jemalloc/jemalloc.html
/usr/share/licenses/jemalloc
/usr/share/licenses/jemalloc/COPYING
/usr/share/man/man3/jemalloc.3.gz

any idea whats up?

Irwin L

unread,
Sep 29, 2016, 3:09:38 AM9/29/16
to ngx-pagespeed-discuss
to add:

ldconfig -p | grep jemal
        libjemalloc.so.2 (libc6,x86-64) => /usr/lib/libjemalloc.so.2
        libjemalloc.so (libc6,x86-64) => /usr/lib/libjemalloc.so

Irwin L

unread,
Sep 29, 2016, 4:19:14 AM9/29/16
to ngx-pagespeed-discuss
ive sorted it out on my own. didn't realize that when it calls via bash function, you gotta escape quotes. compiles  well now.

Joe Duarte

unread,
Oct 2, 2016, 7:14:26 AM10/2/16
to ngx-pagespeed-discuss
Hey thanks George and Irwin. I hadn't even thought of jemalloc – interesting... I've also been curious about tcmalloc, which Google built – it looks good.

Both of you have -mfpmath=sse, but that's the default for 64-bit builds on x86-64, so you don't need it. Also, there's hardly any floating point math in nginx, and it's a good idea to use something like -funsafe-math-optimizations (which is more conservative than its name suggests) or -ffast-math (which is much more aggressive than its name suggests, moreso than -funsafe...) to make sure that whatever fp operations arise don't slow things down. nginx only has one trivial float and a couple of doubles (I think a relative weight variable): https://github.com/nginx/nginx/search?utf8=%E2%9C%93&q=double

Pagespeed has no FP action that I could find. zlib has no FP in program code (just in some test code). OpenSSL has some FP, but I believe the unsafe optimizations are safe for OpenSSL based on the WolfSSL docs I read. I don't think any of the crypto rides on negative zeroes, ± Infinity, etc.

I don't trust -mtune=native because I don't know how reliable it is at knowing the environment it's in, and on VPSes like DigitalOcean or Vultr I'm not sure that the virtual machine environment is stable with respect to which CPU instructions are available. They'd presumably be more likely to upgrade hardware than to go backwards, but I play it safe and use -march=westmere (which gives you all the SSE 4.2 stuff, AES, and carryless multiplication) or -march=haswell (which my Vultr instances support; I don't want the compiler to conjure up any AVX code unless the CPU is at least Haswell, not Sandy or Ivy (when AVX was not that great) and Haswell also gives you FMA and BMI, which are starting to come into play.) 

Profile-guided optimization (PGO) has captured my interest recently. It doesn't seem like many people have played with it. Web servers are so important, and there are so many instances of them running, that I'm surprised we don't have shareable profiles for PGO, and much more effort at testing and tuning all the other compiler optimizations. We really ought to be able to tell the compiler that it is compiling a web server – it's amazing how naive our compilers are, just completely in the dark as to our goals. We should be able to give it a YAML file that tells it that the HTTP request parser will be hit with 10,000 requests per second, and a distribution of what the different request headers will look like, what the different requested resources will be, estimated DB queries and calls to Redis or something, and all the other operations we expect the server to engage in, even before we generate real-world runtime profiles. It should have some awareness of what a web server is, and how it's exercised. I mean it's 2016 man... We ought to have sharable PGO configs (and compiler annotations) all over the web, competing for best performance. It's probably not worth making these toolchain improvements for barbaric C programs – they'd be a natural fit with a clean-sheet programming language and paradigm. I'm doing some research on that end.

Cheers,

Joe
Reply all
Reply to author
Forward
0 new messages