Resource usage when compiling on Linux

Gregory Szorc

unread,

Jan 26, 2012, 9:41:15 PM1/26/12

to

tl;dr Clang seems to use 15% less CPU than GCC on Linux.

For a while now, I have wanted to put some numbers behind build times on
GCC vs Clang. I had also wanted to play around with Linux Control Groups
(cgroups). So, I tackled both at the same time and produced some
interesting numbers.

The following numbers all come from an Ubuntu 11.10 64-bit OS running
under a VMWare VM on top of Windows 7 64-bit. The VM is utilizing 4/16GB
of physical memory, 4/8 cores on a Core i7-2600K, and is backed by a 3
year old Western Digital 7200RPM HD. I was using GCC 4.6.1 and
LLVM/Clang compiled from r148652 (last Sunday) of the LLVM/Clang SVN
repository.

My methodology was basically to create a new cgroup, perform a build
action in that cgroup, then look at that cgroup's stats. If you aren't
familiar with how cgroups work, basically they are execution sandboxes
supported at kernel level. They are useful for setting resource access
priorities and limits. You effectively get the process accounting for
free. All builds were from an m-c pulled today with --enable-tests
--enable-debug --enable-optimize (yes, I know that is funky). I ran the
baseline clobber builds twice to ensure consistency in numbers.

For clobber builds (no objdir, no ccache), GCC spent 3745s (62:25) on
the CPU. Clang spent 3205s (53:25), a difference of 540s (9:00).
Interestingly, wall times were very similar: 22:50 for GCC and 20:49 for
Clang, a difference of 2:01, or ~30s per core. If we assume nothing else
was competing for the CPU, GCC used 69% of available CPU resources over
the wall time and Clang used 64%.

If we turn on ccache and start with an empty cache, we see resource
usage increase. Clang used 3432s (57:12) CPU time, an increase of 227s
(3:47). Wall time increased to 23:28, a net gain of 159s (2:39).
Effective CPU utilization decreased from 64% to 61%.

If we perform a clobber build on Clang with a recently populated ccache,
we see expected build time speed improvements. CPU time drops to a
staggeringly low 344s (5:44), a reduction of 2861s (47:41)! However,
wall time is only reduced to 12:36, 493s (8:13) less than the standard
clobber build. Effective CPU utilization is reduced to 11.4%!

What about I/O? Well, cgroups only reports device-level counters. So,
you miss things that are serviced by the page cache. And, writes don't
seem to be directly recorded because they are flushed out by a process
not running in the cgroup. Oh DTrace, wherefore art thou DTrace! So, the
numbers are practically worthless. But, if you must hear something,
clobber builds seemed to do 1.75GB-2.7GB of device I/O
(blkio.io_service_bytes). Each build used about a solid minute of
allocated device time through the I/O scheduled. Somewhere between 3.4M
and 5.3M sectors (blkio.sectors) were transferred. Service time
(blkio.io_service_time) was between 7:00 and 9:00. Total wait time was
in the 2:30 range.

If you throw building from a populated ccache into the mix, device time,
IO ops, and service bytes effectively double. Total wait time goes
through the roof to ~35 minutes (this is the sum of all wait times for
all queued I/O ops).

FWIW, peak memory usage was ~3520MB (the same linker was used on both
GCC and Clang). It peaks during linking of libxul. This is also when
device I/O went through the roof. During the rest of the build, my page
cache seemed to handle things quite nicely.

About the only semi-solid takeaway I have is that Clang seems to use
~15% less CPU than GCC. cgroup-reported CPU usage across different
builds was consistent and the only piece that changed was the compiler.

I'm not going to speculate on the relationship between CPU, and I/O-page
cache. But, it would certainly be interesting to fiddle with the CPU and
I/O resource throttling features of cgroups and see what effects they
have on builds!

I'm also really curious to put a number to the total bytes flowing
through the I/O subsystem during builds as well as the total number of
memory allocations. But, I'm not sure I can get these from Linux easily.

I hope others find this data interesting.

Greg

Rafael Ávila de Espíndola

unread,

Jan 27, 2012, 9:33:34 AM1/27/12

to dev-b...@lists.mozilla.org

> I hope others find this data interesting.

It is, thanks!

btw, which linker are you using? Does gold help?

Given the amount of IO, I would expect that the debug size on the .o
files to be one of the areas where an improvement in clang would have
the most impact.

> Greg

Cheers,
Rafael

Gregory Szorc

unread,

Jan 27, 2012, 1:03:15 PM1/27/12

to Rafael Ávila de Espíndola

On 1/27/2012 6:33 AM, Rafael Ávila de Espíndola wrote:
> btw, which linker are you using? Does gold help?

I'm using the standard GNU linker. I haven't tried gold.

> Given the amount of IO, I would expect that the debug size on the .o
> files to be one of the areas where an improvement in clang would have
> the most impact.

I didn't state it explicitly in my original post because I wanted to
focus on concrete numbers, but it certainly looked like most of the
device I/O occurred during libxul link. Throughout most of the build, I
would only see the device doing 50-100KB/s total I/O. Every few seconds,
writes would burst out to a few MB/s. But, during libxul link, device
read I/O would sustain at more than 10MB/s for pretty much the entire
duration of link.

My theory is my page cache was satisfying most of the I/O during the
regular build process. Those burst writes every few seconds were the
write buffer flushing to disk. During link, as the linker's RSS grew,
the page cache was evicted, causing more device I/O.

I have a theory that if I give my VM more memory (enough to hold the
source tree, .o files, and the linker's peak RSS), that my device I/O
will effectively go to 0 (minus write buffer flush writes).

I would /really/ like to measure the actual I/O subsystem usage of GCC
vs Clang so I could prove your expectation that Clang reduces I/O
because of smaller .o files. Maybe I'll spin up a FreeBSD VM so I can
throw DTrace at it...

Gregory Szorc

unread,

Jan 27, 2012, 3:26:58 PM1/27/12

to Rafael Ávila de Espíndola

On 1/27/2012 10:03 AM, Gregory Szorc wrote:
> I have a theory that if I give my VM more memory (enough to hold the
> source tree, .o files, and the linker's peak RSS), that my device I/O
> will effectively go to 0 (minus write buffer flush writes).

Proved. And it took a *lot* of memory. Peak memory usage was ~9GB. At
the time of linking, the page cache was ~5.5GB. By the end of the build,
it went up to ~5.6GB. This was with a clobber build against a fully
populated ccache (debug and optimized builds using GNU linker).

When I performed builds with an empty page cache, it was painfully
obvious that I/O was the bottleneck. Looking at system monitoring,
overall CPU wait % was greater than 80% for significant parts of the
build. Here are some raw numbers (coming from cgroups cpuacct and blkio
subsystems):

Empty page cache Populated page cache
Wall time: 9:11 3:20
CPU time 358s (5:58) 301s (5:01)
CPU user: 22s 22s
CPU system: 140s 8s
Effective CPU: 14.75% 34.5%
Disk Time: 58s 0.004s
I/O Wait Time: 2160s 0.15s
I/O Read Bytes: 2448GB 8192 bytes

After removing my objdir at the end of a build, my page cache goes from
~5.6GB to ~2.5GB, a difference of ~3.1GB.

So, conclusions.

If I were assembling a build machine and wanted consistently fast debug
build times, I would need enough memory to prevent page cache eviction.
Since peak memory usage is in the ~9GB range and you might have other
things running on your machine, I think you'd have to go with 12GB to be
on the safe side. If you can throw more in there, great.

Also, I/O matters. A lot. Without data in the page cache, my device read
~2.5GB during the build spending 58s on the I/O scheduler. This comes
out to ~42MB/s. Not too shabby. However, queued I/O operations spent an
aggregated 36 minutes waiting for data from disk. This translated to a
5:50 wall clock difference in build times.

Keep in mind this was all done with a fully populated ccache, so the
conditions were ideal for a fast build. If we remove the ccache and
force the build to become more CPU heavy, the differences likely won't
be as pronounced.

Greg

Mike Hommey

unread,

Jan 27, 2012, 3:49:28 PM1/27/12

to Gregory Szorc, dev-b...@lists.mozilla.org, Rafael Ávila de Espíndola

On Fri, Jan 27, 2012 at 12:26:58PM -0800, Gregory Szorc wrote:
> Keep in mind this was all done with a fully populated ccache, so the
> conditions were ideal for a fast build. If we remove the ccache and
> force the build to become more CPU heavy, the differences likely won't
> be as pronounced.

Note that ccache is more I/O demanding than a normal build, because you
need to read the sources, the system headers, *and* the ccache, and
then you write the object file (and yes, I'm talking about the populated
cache case). If you want to spend less I/O with ccache, you need to
disable compression and enable hardlinks.

Mike

Gregory Szorc

unread,

Jan 27, 2012, 7:03:30 PM1/27/12

to

On 1/27/2012 12:26 PM, Gregory Szorc wrote:
> Empty page cache Populated page cache
> Wall time: 9:11 3:20
> CPU time 358s (5:58) 301s (5:01)
> CPU user: 22s 22s
> CPU system: 140s 8s
> Effective CPU: 14.75% 34.5%
> Disk Time: 58s 0.004s
> I/O Wait Time: 2160s 0.15s
> I/O Read Bytes: 2448GB 8192 bytes

And comparable numbers when running on an SSD (slightly newer m-c revision):

Empty page cache Populated page cache

Wall time: 4:50 3:46
CPU time 291s (4:51) 289s (4:49)
CPU user: 223s 223s
CPU system: 8.2s 7.7s
Effective CPU: 23% 29%
Disk Time: 24.4s 0.007s
I/O Wait Time: 411s 0.14s
I/O Read Bytes: 2421MB 16384 bytes

Also, my original values for CPU user time were off by a power of 10.
Oops! And, my VM is experiencing weird multi-second pauses when running
on the SSD. I think this accounts for the populated ccache wall time
being increased.

I was unable to test with CCACHE_HARDLINK. I put "export
CCACHE_HARDLINK=1" in my .mozconfig, but it didn't seem to have any
effect. Pity. I think that would reduce the page cache size (and thus
overall memory usage) drastically since you would only need to store 1
instance of the binaries.

I wonder how the page cache works when you have a modern filesystem that
does copy-on-write, like ZFS. Is it smart enough to not clone the entry
until the copied file becomes different? Oh, the things I could explore.

Mike Hommey

unread,

Jan 28, 2012, 2:54:05 AM1/28/12

to Gregory Szorc, dev-b...@lists.mozilla.org

CCACHE_HARDLINK has no effect if compression is enabled, and some linux
distros enable it by default.

> I wonder how the page cache works when you have a modern filesystem that
> does copy-on-write, like ZFS. Is it smart enough to not clone the entry
> until the copied file becomes different?

Yes

Mike