[llvm-dev] LLD: time to enable --threads by default

1,572 views
Skip to first unread message

Rui Ueyama via llvm-dev

unread,
Nov 16, 2016, 3:45:24 PM11/16/16
to llvm-dev, Petr Hosek, Eugene Leviant, George Rimar
LLD supports multi-threading, and it seems to be working well as you can see in a recent result. In short, LLD runs 30% faster with --threads option and more than 50% faster if you are using --build-id (your mileage may vary depending on your computer). However, I don't think most users even don't know about that because --threads is not a default option.

I'm thinking to enable --threads by default. We now have real users, and they'll be happy about the performance boost.

Any concerns?

I can't think of problems with that, but I want to write a few notes about that:

 - We still need to focus on single-thread performance rather than multi-threaded one because it is hard to make a slow program faster just by using more threads.

 - We shouldn't do "too clever" things with threads. Currently, we are using multi-threads only at two places where they are highly parallelizable by nature (namely, copying and applying relocations for each input section, and computing build-id hash). We are using parallel_for_each, and that is very simple and easy to understand. I believe this was a right design choice, and I don't think we want to have something like workqueues/tasks in GNU gold, for example.

 - Run benchmarks with --no-threads if you are not focusing on multi-thread performance.

Rafael Espíndola via llvm-dev

unread,
Nov 16, 2016, 3:52:21 PM11/16/16
to Rui Ueyama, Petr Hosek, llvm-dev, Eugene Leviant, George Rimar
I will do a quick benchmark run.

Other than the observations you have my only concern is the situation
where many lld invocations run in parallel, like in a llvm build where
there many outputs in bin/. Our task system doesn't know about load,
so I worry that it might degrade performance in that case.

Cheers,
Rafael

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Renato Golin via llvm-dev

unread,
Nov 16, 2016, 3:55:27 PM11/16/16
to Rui Ueyama, llvm-dev, Eugene Leviant, George Rimar, Petr Hosek
On 16 November 2016 at 20:44, Rui Ueyama via llvm-dev

<llvm...@lists.llvm.org> wrote:
> I'm thinking to enable --threads by default. We now have real users, and
> they'll be happy about the performance boost.

Will it detect single-core computers and disable it? What is the
minimum number of threads that can run in that mode?

Is the penalty on dual core computers less than the gains? If you
could have a VM with only two cores, where the OS is running on one
and LLD threads are running on both, it'd be good to measure the
downgrade.

Rafael's concern is also very real. I/O and memory consumption are
important factors on small footprint systems, though I'd be happy to
have a different default per architecture or even carry the burden of
forcing a --no-threads option every run if the benefits are
substantial.

If those issues are not a concern, than I'm in favour!


> - We still need to focus on single-thread performance rather than
> multi-threaded one because it is hard to make a slow program faster just by
> using more threads.

Agreed.


> - We shouldn't do "too clever" things with threads. Currently, we are using
> multi-threads only at two places where they are highly parallelizable by
> nature (namely, copying and applying relocations for each input section, and
> computing build-id hash). We are using parallel_for_each, and that is very
> simple and easy to understand. I believe this was a right design choice, and
> I don't think we want to have something like workqueues/tasks in GNU gold,
> for example.

Strongly agreed.

cheers,
--renato

Rui Ueyama via llvm-dev

unread,
Nov 16, 2016, 4:28:21 PM11/16/16
to Renato Golin, llvm-dev, Eugene Leviant, George Rimar, Petr Hosek
On Wed, Nov 16, 2016 at 12:55 PM, Renato Golin <renato...@linaro.org> wrote:
On 16 November 2016 at 20:44, Rui Ueyama via llvm-dev
<llvm...@lists.llvm.org> wrote:
> I'm thinking to enable --threads by default. We now have real users, and
> they'll be happy about the performance boost.

Will it detect single-core computers and disable it? What is the
minimum number of threads that can run in that mode?

Is the penalty on dual core computers less than the gains? If you
could have a VM with only two cores, where the OS is running on one
and LLD threads are running on both, it'd be good to measure the
downgrade.

As a quick test, I ran the benchmark again with "taskset -c 0" to use only one core. LLD still spawns 40 threads because my machine has 40 cores (20 physical cores), so 40 threads ran on one core.

With --no-threads (one thread on a single core), it took 6.66 seconds to self-link. With -thread (40 threads on a single core), it took 6.70 seconds. I guess they are mostly in error margin. So I think it wouldn't hurt single core machine.

Rafael may be running his benchmarks and will bring his results.

Rui Ueyama via llvm-dev

unread,
Nov 16, 2016, 4:30:34 PM11/16/16
to Renato Golin, llvm-dev, Eugene Leviant, George Rimar, Petr Hosek
On Wed, Nov 16, 2016 at 12:55 PM, Renato Golin <renato...@linaro.org> wrote:
On 16 November 2016 at 20:44, Rui Ueyama via llvm-dev
<llvm...@lists.llvm.org> wrote:
> I'm thinking to enable --threads by default. We now have real users, and
> they'll be happy about the performance boost.

Will it detect single-core computers and disable it? What is the
minimum number of threads that can run in that mode?

Is the penalty on dual core computers less than the gains? If you
could have a VM with only two cores, where the OS is running on one
and LLD threads are running on both, it'd be good to measure the
downgrade.

Rafael's concern is also very real. I/O and memory consumption are
important factors on small footprint systems, though I'd be happy to
have a different default per architecture or even carry the burden of
forcing a --no-threads option every run if the benefits are
substantial.

On such a computer, you don't want to enable threads at all, no? If so, you can build LLVM without LLVM_ENABLE_THREADS.

Renato Golin via llvm-dev

unread,
Nov 16, 2016, 4:44:02 PM11/16/16
to Rui Ueyama, llvm-dev, Eugene Leviant, George Rimar, Petr Hosek
On 16 November 2016 at 21:29, Rui Ueyama <ru...@google.com> wrote:
> On such a computer, you don't want to enable threads at all, no? If so, you
> can build LLVM without LLVM_ENABLE_THREADS.

ARM hardware varies greatly, you don't want to restrict that much.

Some boards have one core and 512MB of RAM, others have 8 cores and
1GB, others 8 cores and 32GB, others 96 cores, and so on.

You have mostly answered my questions, though.

Single -> multi thread in a single core is mostly noise and the number
of threads are detected from the number of available cores. That
should be fine on most cases, even on old ARM.

Rafael Espíndola via llvm-dev

unread,
Nov 16, 2016, 4:55:29 PM11/16/16
to Rui Ueyama, Petr Hosek, llvm-dev, Eugene Leviant, George Rimar
On 16 November 2016 at 15:52, Rafael Espíndola

<rafael.e...@gmail.com> wrote:
> I will do a quick benchmark run.


On a mac pro (running linux) the results I got with all cores available:

firefox
master 7.146418217
patch 5.304271767 1.34729488437x faster
firefox-gc
master 7.316743822
patch 5.46436812 1.33899174824x faster
chromium
master 4.265597914
patch 3.972218527 1.07385781648x faster
chromium fast
master 1.823614026
patch 1.686059427 1.08158348205x faster
the gold plugin
master 0.340167513
patch 0.318601465 1.06768973269x faster
clang
master 0.579914119
patch 0.520784947 1.11353855817x faster
llvm-as
master 0.03323043
patch 0.041571719 1.251013574x slower
the gold plugin fsds
master 0.36675887
patch 0.350970944 1.04498356992x faster
clang fsds
master 0.656180056
patch 0.591607603 1.10914743602x faster
llvm-as fsds
master 0.030324313
patch 0.040045353 1.32056917497x slower
scylla
master 3.23378908
patch 2.019191831 1.60152642773x faster

With only 2 cores:

firefox
master 7.174839911
patch 6.319808477 1.13529388384x faster
firefox-gc
master 7.345525844
patch 6.493005841 1.13129820362x faster
chromium
master 4.180752414
patch 4.129515199 1.01240756179x faster
chromium fast
master 1.847296843
patch 1.78837299 1.0329483018x faster
the gold plugin
master 0.341725451
patch 0.339943222 1.0052427255x faster
clang
master 0.581901114
patch 0.566932481 1.02640284955x faster
llvm-as
master 0.03381059
patch 0.036671392 1.08461260215x slower
the gold plugin fsds
master 0.369184003
patch 0.368774353 1.00111084189x faster
clang fsds
master 0.660120583
patch 0.641040511 1.02976422187x faster
llvm-as fsds
master 0.031074029
patch 0.035421531 1.13990789543x slower
scylla
master 3.243011681
patch 2.630991522 1.23261958615x faster


With only 1 core:

firefox
master 7.174323116
patch 7.301968002 1.01779190649x slower
firefox-gc
master 7.339104117
patch 7.466171668 1.01731376868x slower
chromium
master 4.176958448
patch 4.188387233 1.00273615003x slower
chromium fast
master 1.848922713
patch 1.858714219 1.00529578978x slower
the gold plugin
master 0.342383846
patch 0.347106743 1.01379415838x slower
clang
master 0.582476955
patch 0.600524655 1.03098440178x slower
llvm-as
master 0.033248459
patch 0.035622988 1.07141771593x slower
the gold plugin fsds
master 0.369510236
patch 0.376390506 1.01861997133x slower
clang fsds
master 0.661267753
patch 0.683417482 1.03349585535x slower
llvm-as fsds
master 0.030574688
patch 0.033052779 1.08105041006x slower
scylla
master 3.236604638
patch 3.325831407 1.02756801617x slower

Given that we have an improvement even with just two cores available, LGTM.

Cheers,
Rafael

Rui Ueyama via llvm-dev

unread,
Nov 16, 2016, 6:59:32 PM11/16/16
to Rafael Espíndola, Petr Hosek, llvm-dev, Eugene Leviant, George Rimar
By the way, while running benchmark, I found that our SHA1 function seems much slower than the one in gold. gold slowed down by only 1.3 seconds to compute a SHA1 of output, but we spent 6.0 seconds to do the same thing (I believe). Something doesn't seem right.

Here is a table to link the same binary with -no-threads and -build-id={none,md5,sha1}. The numbers are in seconds.

       LLD   gold
none   7.82  13.78
MD5    9.68  14.56
SHA1  13.85  15.05

Renato Golin via llvm-dev

unread,
Nov 16, 2016, 7:01:47 PM11/16/16
to Rafael Espíndola, llvm-dev, George Rimar, Eugene Leviant, Petr Hosek
On 16 November 2016 at 21:46, Rafael Espíndola via llvm-dev

<llvm...@lists.llvm.org> wrote:
> Given that we have an improvement even with just two cores available, LGTM.

Thanks for the extensive benchmarking! :)

LGTM, too.

cheers,
--renato

Mehdi Amini via llvm-dev

unread,
Nov 16, 2016, 7:03:38 PM11/16/16
to Rui Ueyama, Amaury SECHET, llvm-dev, Eugene Leviant, Petr Hosek, George Rimar
SHA1 in LLVM is *very* naive, any improvement is welcome there!
It think Amaury pointed it originally and he had an alternative implementation IIRC.

— 
Mehdi

Rui Ueyama via llvm-dev

unread,
Nov 16, 2016, 7:06:14 PM11/16/16
to Mehdi Amini, Petr Hosek, llvm-dev, Eugene Leviant, Amaury SECHET, George Rimar
Can we just copy-and-paste optimized code from somewhere?

Mehdi Amini via llvm-dev

unread,
Nov 16, 2016, 7:11:05 PM11/16/16
to Rui Ueyama, Petr Hosek, llvm-dev, Eugene Leviant, Amaury SECHET, George Rimar
The current implementation was “copy/pasted” from somewhere (it was explicitly public domain).

Rui Ueyama via llvm-dev

unread,
Nov 16, 2016, 7:14:33 PM11/16/16
to Mehdi Amini, Petr Hosek, llvm-dev, Eugene Leviant, Amaury SECHET, George Rimar
I should've said that do you know if there's an optimized SHA1 implementation that we can use?

Mehdi Amini via llvm-dev

unread,
Nov 16, 2016, 7:16:55 PM11/16/16
to Rui Ueyama, Petr Hosek, llvm-dev, Eugene Leviant, Amaury SECHET, George Rimar
No, otherwise I’d have pick it instead of this one ;-)

The alternative plan was either to:
1) reach out to someone who has written an optimized one and convince him to contribute it to LLVM
2) “optimize” the one in tree.

But not high enough in my priority list.

— 
Mehdi

Rui Ueyama via llvm-dev

unread,
Nov 16, 2016, 7:26:05 PM11/16/16
to Mehdi Amini, Petr Hosek, llvm-dev, Eugene Leviant, Amaury SECHET, George Rimar
Have you considered using OpenSSL's implementation? (just wondering, I don't know much about OpenSSL, but I guess their code is optimized.)

Joerg Sonnenberger via llvm-dev

unread,
Nov 16, 2016, 8:13:39 PM11/16/16
to llvm...@lists.llvm.org
On Wed, Nov 16, 2016 at 04:05:38PM -0800, Rui Ueyama via llvm-dev wrote:
> Can we just copy-and-paste optimized code from somewhere?

The NetBSD version is also PD and uses much more aggressive loop
unrolling:

https://github.com/jsonn/src/blob/trunk/common/lib/libc/hash/sha1/sha1.c

It's still a bit slower than an optimised assembler version, but
typically good enough.

Joerg

Joerg Sonnenberger via llvm-dev

unread,
Nov 16, 2016, 8:15:06 PM11/16/16
to llvm...@lists.llvm.org
On Wed, Nov 16, 2016 at 12:44:46PM -0800, Rui Ueyama via llvm-dev wrote:
> I'm thinking to enable --threads by default. We now have real users, and
> they'll be happy about the performance boost.
>
> Any concerns?

What is the total time consumped, not just the real time? When building
a large project, linking is often done in parallel with other tasks, so
wasting a lot of CPU to save a bit of real time is not necessarily a net
win.

Rui Ueyama via llvm-dev

unread,
Nov 16, 2016, 8:27:00 PM11/16/16
to Joerg Sonnenberger, llvm-dev
Did you see this http://llvm.org/viewvc/llvm-project?view=revision&revision=287140 ? Interpreting these numbers may be tricky because of hyper threading, though.

Joerg Sonnenberger via llvm-dev

unread,
Nov 16, 2016, 9:14:04 PM11/16/16
to llvm-dev
On Wed, Nov 16, 2016 at 05:26:23PM -0800, Rui Ueyama wrote:
> Did you see this
> http://llvm.org/viewvc/llvm-project?view=revision&revision=287140 ?
> Interpreting these numbers may be tricky because of hyper threading, though.

Can you try that with a CPU set that explicitly doesn't include the HT
cores? That's more likely to give a reasonable answer for "what is the
thread overhead".

Rui Ueyama via llvm-dev

unread,
Nov 16, 2016, 10:21:13 PM11/16/16
to Joerg Sonnenberger, llvm-dev
Here is the result of running 20 threads on 20 physical cores (40 virtual cores).

      19002.081139 task-clock (msec)         #    2.147 CPUs utilized            ( +-  2.88% )
            23,006 context-switches          #    0.001 M/sec                    ( +-  2.24% )
             1,491 cpu-migrations            #    0.078 K/sec                    ( +- 22.50% )
         2,607,076 page-faults               #    0.137 M/sec                    ( +-  0.83% )
    56,818,049,785 cycles                    #    2.990 GHz                      ( +-  2.54% )
    41,072,435,357 stalled-cycles-frontend   #   72.29% frontend cycles idle     ( +-  3.36% )
   <not supported> stalled-cycles-backend  
    41,090,608,917 instructions              #    0.72  insns per cycle        
                                             #    1.00  stalled cycles per insn  ( +-  0.46% )
     7,621,825,115 branches                  #  401.105 M/sec                    ( +-  0.52% )
       139,383,452 branch-misses             #    1.83% of all branches          ( +-  0.18% )

       8.848611242 seconds time elapsed                                          ( +-  2.72% )

and this is the single-thread result.

      12738.416627 task-clock (msec)         #    1.000 CPUs utilized            ( +-  5.04% )
             1,283 context-switches          #    0.101 K/sec                    ( +-  5.49% )
                 3 cpu-migrations            #    0.000 K/sec                    ( +- 55.20% )
         2,614,435 page-faults               #    0.205 M/sec                    ( +-  2.52% )
    41,732,843,312 cycles                    #    3.276 GHz                      ( +-  5.76% )
    26,816,171,736 stalled-cycles-frontend   #   64.26% frontend cycles idle     ( +-  8.48% )
   <not supported> stalled-cycles-backend  
    39,776,444,917 instructions              #    0.95  insns per cycle        
                                             #    0.67  stalled cycles per insn  ( +-  0.84% )
     7,288,624,141 branches                  #  572.177 M/sec                    ( +-  1.02% )
       135,684,171 branch-misses             #    1.86% of all branches          ( +-  0.12% )

      12.734335840 seconds time elapsed                                          ( +-  5.03% )

Peter Smith via llvm-dev

unread,
Nov 17, 2016, 4:48:19 AM11/17/16
to Rui Ueyama, llvm-dev
I've no objections to changing the default as multi-threading can
always be turned off and it stands to benefit most.

One possible thing to consider is would multi-threading increase
memory usage? I'm most concerned about virtual address space as this
can get eaten up very quickly on a 32-bit machine, particularly when
debug is used. Given that the data set isn't increased when enabling
multiple threads I speculate that the biggest risk would be different
threads mmapping overlapping parts of the files in a non-shared way.

It will be worth keeping track of how much memory is being used as
people may need to alter their maximum number of parallel link jobs to
compensate. From prior experience building clang with debug on a 16-gb
machine using -j8 will bring it to a halt.

Peter

On 17 November 2016 at 03:20, Rui Ueyama via llvm-dev

Renato Golin via llvm-dev

unread,
Nov 17, 2016, 4:49:43 AM11/17/16
to Rui Ueyama, llvm-dev
On 17 November 2016 at 03:20, Rui Ueyama via llvm-dev
<llvm...@lists.llvm.org> wrote:
> Here is the result of running 20 threads on 20 physical cores (40 virtual
> cores).
>
> 19002.081139 task-clock (msec) # 2.147 CPUs utilized
> 12738.416627 task-clock (msec) # 1.000 CPUs utilized

Sounds like threading isn't beneficial much beyond the second CPU...
Maybe blindly creating one thread per core isn't the best plan...

--renato

Rafael Espíndola via llvm-dev

unread,
Nov 17, 2016, 7:11:33 AM11/17/16
to Renato Golin, llvm-dev
> Sounds like threading isn't beneficial much beyond the second CPU...
> Maybe blindly creating one thread per core isn't the best plan...

parallel.h is pretty simplistic at the moment. Currently it creates
one per SMT. One per core and being lazy about it would probably be a
good thing, but threading is already beneficial and improving
parallel.h an welcome improvement.

Cheers,
Rafael

Teresa Johnson via llvm-dev

unread,
Nov 17, 2016, 9:13:02 AM11/17/16
to Rafael Espíndola, llvm-dev
On Thu, Nov 17, 2016 at 4:11 AM, Rafael Espíndola via llvm-dev <llvm...@lists.llvm.org> wrote:
> Sounds like threading isn't beneficial much beyond the second CPU...
> Maybe blindly creating one thread per core isn't the best plan...

parallel.h is pretty simplistic at the moment. Currently it creates
one per SMT. One per core and being lazy about it would probably be a
good thing, but threading is already beneficial and improving
parallel.h an welcome improvement.

Instead of using std::thread::hardware_concurrency (which is one per SMT), you may be interested in using the facility I added for setting default ThinLTO backend parallelism so that one per physical core is created, llvm::heavyweight_hardware_concurrency() (see D25585  and r284390). The name is meant to indicate that this is the concurrency that should be used for heavier weight tasks (that may use a lot of memory e.g.).

Teresa


Cheers,
Rafael
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev



--
Teresa Johnson | Software Engineer | tejo...@google.com | 408-460-2413

Rui Ueyama via llvm-dev

unread,
Nov 17, 2016, 12:10:56 PM11/17/16
to Renato Golin, llvm-dev
On Thu, Nov 17, 2016 at 1:49 AM, Renato Golin <renato...@linaro.org> wrote:
On 17 November 2016 at 03:20, Rui Ueyama via llvm-dev
<llvm...@lists.llvm.org> wrote:
> Here is the result of running 20 threads on 20 physical cores (40 virtual
> cores).
>
>       19002.081139 task-clock (msec)         #    2.147 CPUs utilized
>       12738.416627 task-clock (msec)         #    1.000 CPUs utilized

Sounds like threading isn't beneficial much beyond the second CPU...
Maybe blindly creating one thread per core isn't the best plan...

That's an average. When it's at peak, it's using more than two cores.

Rui Ueyama via llvm-dev

unread,
Nov 17, 2016, 12:19:57 PM11/17/16
to Renato Golin, llvm-dev
I submitted r287237 to make -threads default. You can disable it with -no-threads. Thanks!

Rui Ueyama via llvm-dev

unread,
Nov 17, 2016, 12:42:12 PM11/17/16
to Teresa Johnson, llvm-dev
On Thu, Nov 17, 2016 at 6:12 AM, Teresa Johnson via llvm-dev <llvm...@lists.llvm.org> wrote:


On Thu, Nov 17, 2016 at 4:11 AM, Rafael Espíndola via llvm-dev <llvm...@lists.llvm.org> wrote:
> Sounds like threading isn't beneficial much beyond the second CPU...
> Maybe blindly creating one thread per core isn't the best plan...

parallel.h is pretty simplistic at the moment. Currently it creates
one per SMT. One per core and being lazy about it would probably be a
good thing, but threading is already beneficial and improving
parallel.h an welcome improvement.

Instead of using std::thread::hardware_concurrency (which is one per SMT), you may be interested in using the facility I added for setting default ThinLTO backend parallelism so that one per physical core is created, llvm::heavyweight_hardware_concurrency() (see D25585  and r284390). The name is meant to indicate that this is the concurrency that should be used for heavier weight tasks (that may use a lot of memory e.g.).

Sorry for my ignorance, but what's the point of running the same number of threads as the number of physical cores instead of HT virtual cores? If we can get better throughput by not running more than one thread per a physical core, it feels like HT is a useless technology.

Mehdi Amini via llvm-dev

unread,
Nov 17, 2016, 12:50:47 PM11/17/16
to Rui Ueyama, llvm-dev
On Nov 17, 2016, at 9:41 AM, Rui Ueyama via llvm-dev <llvm...@lists.llvm.org> wrote:

On Thu, Nov 17, 2016 at 6:12 AM, Teresa Johnson via llvm-dev <llvm...@lists.llvm.org> wrote:


On Thu, Nov 17, 2016 at 4:11 AM, Rafael Espíndola via llvm-dev <llvm...@lists.llvm.org> wrote:
> Sounds like threading isn't beneficial much beyond the second CPU...
> Maybe blindly creating one thread per core isn't the best plan...

parallel.h is pretty simplistic at the moment. Currently it creates
one per SMT. One per core and being lazy about it would probably be a
good thing, but threading is already beneficial and improving
parallel.h an welcome improvement.

Instead of using std::thread::hardware_concurrency (which is one per SMT), you may be interested in using the facility I added for setting default ThinLTO backend parallelism so that one per physical core is created, llvm::heavyweight_hardware_concurrency() (see D25585  and r284390). The name is meant to indicate that this is the concurrency that should be used for heavier weight tasks (that may use a lot of memory e.g.).

Sorry for my ignorance, but what's the point of running the same number of threads as the number of physical cores instead of HT virtual cores? If we can get better throughput by not running more than one thread per a physical core, it feels like HT is a useless technology.

It depends on the use-case: with ThinLTO we scale linearly with the number of physical cores. When you get over the number of physical cores you still get some improvements, but that’s no longer linear.
The profitability question is a tradeoff one: for example if each of your task is very memory intensive, you may not want to overcommit the cores or increase the ratio of available mem per physical core.

To take some number as an example: if your average user has a 8GB machine with 4 cores (8 virtual cores with HT), and you know that each of your parallel tasks is consuming 1.5GB of memory on average, then having 4 parallel workers threads to process your tasks will lead to a peak memory of 6GB, having 8 parallel threads will lead to a peak mem of 12GB and the machine will start to swap.

Another consideration is that having the linker issuing threads behind the back of the build system isn’t great: the build system is supposed to exploit the parallelism. Now if it spawn 10 linker jobs in parallel, how many threads are competing for the hardware?

So, HT is not useless, but it is not universally applicable or universally efficient in the same way.

Hope it makes sense!

— 
Mehdi


Rui Ueyama via llvm-dev

unread,
Nov 17, 2016, 1:01:23 PM11/17/16
to Mehdi Amini, llvm-dev
Thank you for the explanation! That makes sense.

Unlike ThinLTO, each thread in LLD consumes very small amount of memory (probably just a few megabytes), so that's not a problem for me. At the final stage of linking, we spawn threads to copy section contents and apply relocations, and I guess that causes a lot of memory traffic because that's basically memcpy'ing input files to an output file, so the memory bandwidth could be a limiting factor there. But I do not see a reason to limit the number of threads to the number of physical core. For LLD, it seems like we can just spawn as many threads as HT provides.

Teresa Johnson via llvm-dev

unread,
Nov 17, 2016, 1:08:15 PM11/17/16
to Rui Ueyama, llvm-dev
Ok, sure - I was just suggesting based on Rafael's comment above about lld currently creating one thread per SMT and possibly wanting one per core instead. It will definitely depend on the characteristics of your parallel tasks (which is why the name of the interface was changed to include "heavyweight" aka large memory intensive, since the implementation may return something other than # physical cores for other architectures - right now it is just implemented for x86 otherwise returns thread::hardware_concurrency()).

Teresa

Renato Golin via llvm-dev

unread,
Nov 17, 2016, 1:27:22 PM11/17/16
to Mehdi Amini, llvm-dev
On 17 November 2016 at 17:50, Mehdi Amini via llvm-dev

<llvm...@lists.llvm.org> wrote:
> It depends on the use-case: with ThinLTO we scale linearly with the number
> of physical cores. When you get over the number of physical cores you still
> get some improvements, but that’s no longer linear.

Indeed, in HT, cores have two execution units on the same cache/bus
line, so memory access is likely to be contrived. Linkers are memory
hungry, which add to the I/O bottleneck which makes most of the gain
disappear. :)

Furthermore, the FP unit is also shared among the ALUs, so
FP-intensive code does not make good use of HT. Not the case, here,
though.

cheers,
-renato

Rui Ueyama via llvm-dev

unread,
Nov 17, 2016, 1:32:22 PM11/17/16
to Renato Golin, llvm-dev
On Thu, Nov 17, 2016 at 10:27 AM, Renato Golin <renato...@linaro.org> wrote:
On 17 November 2016 at 17:50, Mehdi Amini via llvm-dev
<llvm...@lists.llvm.org> wrote:
> It depends on the use-case: with ThinLTO we scale linearly with the number
> of physical cores. When you get over the number of physical cores you still
> get some improvements, but that’s no longer linear.

Indeed, in HT, cores have two execution units on the same cache/bus
line, so memory access is likely to be contrived. Linkers are memory
hungry, which add to the I/O bottleneck which makes most of the gain
disappear. :)

Furthermore, the FP unit is also shared among the ALUs, so
FP-intensive code does not make good use of HT. Not the case, here,
though.

We literally have no variables of type float or double in LLD. It would work fine on 486SX. :)

Rafael Espíndola via llvm-dev

unread,
Nov 17, 2016, 4:20:32 PM11/17/16
to Rui Ueyama, llvm-dev
>
> Thank you for the explanation! That makes sense.
>
> Unlike ThinLTO, each thread in LLD consumes very small amount of memory
> (probably just a few megabytes), so that's not a problem for me. At the
> final stage of linking, we spawn threads to copy section contents and apply
> relocations, and I guess that causes a lot of memory traffic because that's
> basically memcpy'ing input files to an output file, so the memory bandwidth
> could be a limiting factor there. But I do not see a reason to limit the
> number of threads to the number of physical core. For LLD, it seems like we
> can just spawn as many threads as HT provides.


It is quite common for SMT to *not* be profitable. I did notice some
small wins by not using it. On an intel machine you can do a quick
check by running with half the threads since they always have 2x SMT.

Davide Italiano via llvm-dev

unread,
Nov 17, 2016, 9:31:32 PM11/17/16
to Rafael Espíndola, llvm-dev
On Thu, Nov 17, 2016 at 1:20 PM, Rafael Espíndola via llvm-dev
<llvm...@lists.llvm.org> wrote:
>>
>> Thank you for the explanation! That makes sense.
>>
>> Unlike ThinLTO, each thread in LLD consumes very small amount of memory
>> (probably just a few megabytes), so that's not a problem for me. At the
>> final stage of linking, we spawn threads to copy section contents and apply
>> relocations, and I guess that causes a lot of memory traffic because that's
>> basically memcpy'ing input files to an output file, so the memory bandwidth
>> could be a limiting factor there. But I do not see a reason to limit the
>> number of threads to the number of physical core. For LLD, it seems like we
>> can just spawn as many threads as HT provides.
>
>
> It is quite common for SMT to *not* be profitable. I did notice some
> small wins by not using it. On an intel machine you can do a quick
> check by running with half the threads since they always have 2x SMT.
>

I had the same experience. Ideally I would like to have a way to
override the number of threads used by the linker.
gold has a plethora of options for doing that, i.e.

--thread-count COUNT Number of threads to use
--thread-count-initial COUNT
Number of threads to use in initial pass
--thread-count-middle COUNT Number of threads to use in middle pass
--thread-count-final COUNT Number of threads to use in final pass

I don't think we need the full generality/flexibility of
initial/middle/final, but --thread-count could be useful (at least for
experimenting). The current interface of `parallel_for_each` doesn't
allow to specify the number of threads to be run, so, assuming lld
goes that route (it may not), that should be extended accordingly.

--
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Rui Ueyama via llvm-dev

unread,
Nov 17, 2016, 10:34:39 PM11/17/16
to Davide Italiano, llvm-dev
I agree that these options would be useful for testing, but I'm reluctant to expose them as user options because I wish LLD would just work out of the box without turning lots of knobs.

Davide Italiano via llvm-dev

unread,
Nov 17, 2016, 11:05:42 PM11/17/16
to Rui Ueyama, llvm-dev

I share your view that lld should work fine out-the-box. I think an alternative
is having the option as hidden, maybe. I consider the set of users
tinkering with linker options not large, although there are some
people who like to override/"tune" the linker anyway, so IMHO we
should expose a sane default and let users decide if they care or not
(a similar example is what we do for --thinlto-threads or
--lto-partitions, even if in the last case we still have that set to 1
because it's not entirely clear what's a reasonable number).

Davide Italiano via llvm-dev

unread,
Nov 17, 2016, 11:10:30 PM11/17/16
to Rui Ueyama, llvm-dev

I've seen a case where the linker was pinned to a specific subset of the CPUs
and many linker invocations were launched in parallel.
(actually, this is the only time when I've seen --threads for gold used).
I personally don't expect this to be the common use-case, but it's not hard
to imagine complex build systems adopting a similar strategy.

Rui Ueyama via llvm-dev

unread,
Nov 18, 2016, 12:25:54 PM11/18/16
to Davide Italiano, llvm-dev
Sure. If you want to add --thread-count (but not other options, such as --thread-count-initial), that's fine with me.

Sean Silva via llvm-dev

unread,
Nov 23, 2016, 2:41:11 AM11/23/16
to Rui Ueyama, llvm-dev, Eugene Leviant, George Rimar, Petr Hosek


On Wed, Nov 16, 2016 at 12:44 PM, Rui Ueyama via llvm-dev <llvm...@lists.llvm.org> wrote:
LLD supports multi-threading, and it seems to be working well as you can see in a recent result. In short, LLD runs 30% faster with --threads option and more than 50% faster if you are using --build-id (your mileage may vary depending on your computer). However, I don't think most users even don't know about that because --threads is not a default option.

I'm thinking to enable --threads by default. We now have real users, and they'll be happy about the performance boost.

Any concerns?

I can't think of problems with that, but I want to write a few notes about that:

 - We still need to focus on single-thread performance rather than multi-threaded one because it is hard to make a slow program faster just by using more threads.

 - We shouldn't do "too clever" things with threads. Currently, we are using multi-threads only at two places where they are highly parallelizable by nature (namely, copying and applying relocations for each input section, and computing build-id hash). We are using parallel_for_each, and that is very simple and easy to understand. I believe this was a right design choice, and I don't think we want to have something like workqueues/tasks in GNU gold, for example.

Sorry for the late response.

Copying and applying relocations is actually are not as parallelizable as you would imagine in current LLD. The reason is that there is an implicit serialization when mutating the kernel's VA map (which happens any time there is a minor page fault, i.e. the first time you touch a page of an mmap'd input). Since threads share the same VA, there is an implicit serialization across them. Separate processes are needed to avoid this overhead (note that the separate processes would still have the same output file mapped; so (at least with fixed partitioning) there is no need for complex IPC).

For `ld.lld -O0` on Mac host, I measured <1GB/s copying speed, even though the machine I was running on had like 50 GB/s DRAM bandwidth; so the VA overhead is on the order of a 50x slowdown for this copying operation in this extreme case, so Amdahl's law indicates that there will be practically no speedup for this copy operation by adding multiple threads. I've also DTrace'd this to see massive contention on the VA lock. LInux will be better but no matter how good, it is still a serialization point and Amdahl's law will limit your speedup significantly.

-- Sean Silva
 

 - Run benchmarks with --no-threads if you are not focusing on multi-thread performance.

Rafael Espíndola via llvm-dev

unread,
Nov 23, 2016, 9:32:00 AM11/23/16
to Sean Silva, llvm-dev, George Rimar, Eugene Leviant, Petr Hosek
Interesting. Might be worth giving a try again to the idea of creating
the file in anonymous memory and using a write to output it.

Cheers,
Rafael

On 23 November 2016 at 02:41, Sean Silva via llvm-dev

Sean Silva via llvm-dev

unread,
Nov 23, 2016, 7:53:25 PM11/23/16
to Rafael Espíndola, llvm-dev, George Rimar, Eugene Leviant, Petr Hosek
On Wed, Nov 23, 2016 at 6:31 AM, Rafael Espíndola <rafael.e...@gmail.com> wrote:
Interesting. Might be worth giving a try again to the idea of creating
the file in anonymous memory and using a write to output it.

I'm not sure that will help. Even the kernel can't escape some of these costs; in modern 64-bit operating systems when you do a syscall you don't actually change the mappings (TLB flush would be expensive), so the cost of populating the page tables in order to read the pages is still there (and hence the serialization point remains). One alternative is to use multiple processes instead of multiple threads which would remove serialization point by definition (it also seems like it might be less invasive of a change, at least for the copying+relocating step).

One experiment might be to add a hack to pre-fault all the files that are used, so that you can isolate that cost from the rest of the link. That will give you an upper bound on the speedup that there is to get from optimizing this.
Pre-faulting the allocations removes the serialization bottleneck on the kernel VA, since after the page tables are fully populated, they become a read-only data structure and each core's hardware TLB walker can read it independently.

For example, you could change elf::link to optionally take a map from file paths to buffers, which will override the native filesystem. Then in main() (before calling elf::link) you can map and pre-fault all the input files (which can be found from a trace of a previous run or whatever). By timing the subsequent call to elf::link you can get the desired measurement.

The ability to pass in a map of buffers would allow other experiments that would be interesting. For example, the experiment above could be repeated with all the input buffers copied into a handful of 1GB pages. This would allow entirely eliminating the hardware TLB walking overhead for input buffers.

-- Sean Silva

Rui Ueyama via llvm-dev

unread,
Nov 23, 2016, 8:01:04 PM11/23/16
to Sean Silva, llvm-dev, Eugene Leviant, Petr Hosek, George Rimar
On Wed, Nov 23, 2016 at 4:53 PM, Sean Silva <chiso...@gmail.com> wrote:


On Wed, Nov 23, 2016 at 6:31 AM, Rafael Espíndola <rafael.e...@gmail.com> wrote:
Interesting. Might be worth giving a try again to the idea of creating
the file in anonymous memory and using a write to output it.

I'm not sure that will help. Even the kernel can't escape some of these costs; in modern 64-bit operating systems when you do a syscall you don't actually change the mappings (TLB flush would be expensive), so the cost of populating the page tables in order to read the pages is still there (and hence the serialization point remains). One alternative is to use multiple processes instead of multiple threads which would remove serialization point by definition (it also seems like it might be less invasive of a change, at least for the copying+relocating step).

One experiment might be to add a hack to pre-fault all the files that are used, so that you can isolate that cost from the rest of the link. That will give you an upper bound on the speedup that there is to get from optimizing this.

I experimented to add MAP_POPULATE to LLVM's mmap in the hope that it would do what you, but it made LLD 10% slower, and I cannot explain why.

Sean Silva via llvm-dev

unread,
Nov 23, 2016, 8:29:58 PM11/23/16
to Rui Ueyama, llvm-dev, Eugene Leviant, Petr Hosek, George Rimar
On Wed, Nov 23, 2016 at 5:00 PM, Rui Ueyama <ru...@google.com> wrote:
On Wed, Nov 23, 2016 at 4:53 PM, Sean Silva <chiso...@gmail.com> wrote:


On Wed, Nov 23, 2016 at 6:31 AM, Rafael Espíndola <rafael.e...@gmail.com> wrote:
Interesting. Might be worth giving a try again to the idea of creating
the file in anonymous memory and using a write to output it.

I'm not sure that will help. Even the kernel can't escape some of these costs; in modern 64-bit operating systems when you do a syscall you don't actually change the mappings (TLB flush would be expensive), so the cost of populating the page tables in order to read the pages is still there (and hence the serialization point remains). One alternative is to use multiple processes instead of multiple threads which would remove serialization point by definition (it also seems like it might be less invasive of a change, at least for the copying+relocating step).

One experiment might be to add a hack to pre-fault all the files that are used, so that you can isolate that cost from the rest of the link. That will give you an upper bound on the speedup that there is to get from optimizing this.

I experimented to add MAP_POPULATE to LLVM's mmap in the hope that it would do what you, but it made LLD 10% slower, and I cannot explain why.

It may be that LLD does not touch every page of the input files, so MAP_POPULATE is causing the kernel to do unnecessary work faulting in things that LLD will never look at.

The purpose of the experiment I described however is not for those changes to make LLD faster per se, but to allow measuring how much faster LLD could be by optimizing this; with something like ftrace or dtrace you could measure the kernel time spent in MAP_POPULATE and subtract it to get a similar measurement.  

-- Sean Silva

Sean Silva via llvm-dev

unread,
Nov 23, 2016, 10:13:21 PM11/23/16
to Rafael Espíndola, llvm-dev, George Rimar, Eugene Leviant, Petr Hosek
On Wed, Nov 23, 2016 at 4:53 PM, Sean Silva <chiso...@gmail.com> wrote:


On Wed, Nov 23, 2016 at 6:31 AM, Rafael Espíndola <rafael.e...@gmail.com> wrote:
Interesting. Might be worth giving a try again to the idea of creating
the file in anonymous memory and using a write to output it.

I'm not sure that will help. Even the kernel can't escape some of these costs; in modern 64-bit operating systems when you do a syscall you don't actually change the mappings (TLB flush would be expensive), so the cost of populating the page tables in order to read the pages is still there (and hence the serialization point remains).

I looked a bit more closely at this. In current linux, the actual faulting doesn't seem like it will have contention unless it needs to do memory allocation for the page table structures (but even then, linux's memory allocation is hopefully pretty optimized and scalable). The serialization point is when the kernel VM data structures need to be updated (NOT the hardware page tables which are usually populated lazily), which generally will only happen inside the mmap call.

This should mean that the primary serialization points in the kernel will be:
1. during the mmap call on any file
2. depending on where the kernel maps each input file, during the first accesses to a file to allocate page table structures (but that should be pretty scalable)
3. allocating the output files (this should be pretty scalable too since it is just memory allocation)

A simple test on my machine at home (i7-6700HQ CPU @ 2.60GHz) shows about 90ms to fill a 1GB buffer when it is freshly allocated (i.e., while the buffer is unfaulted and the kernel has to fault the memory during the memset) and about 45ms subsequent times (when the buffer is already faulted). Linux is using transparent huge pages though so that is with 2M pages. Turning off transparent hugepages so that 4k pages are used, the initial memset time goes up to about 280ms and subsequent times are still about 45ms. So with 4k pages, the slowdown is more than 6x for the initial faulting. Adding threads helps a bit, but cannot hide the fundamental work that the kernel has to do to allocate the memory.

4kpages-1thread-unfaulted 280ms
4kpages-2thread-unfaulted 160ms
4kpages-3thread-unfaulted 110ms
4kpages-4thread-unfaulted 100ms

2Mpages-1thread-unfaulted 90ms
2Mpages-2thread-unfaulted 60ms
2Mpages-3thread-unfaulted 65ms
2Mpages-4thread-unfaulted 70ms

4kpages-1thread-faulted 45ms
2Mpages-1thread-faulted 45ms

(note: these are pretty rough numbers; they are just to get a feel for the scaling)

Also, adding threads for this test case *after the pages are faulted in* only helps a little bit since it is mostly bottlenecked on DRAM bandwidth (it is just a simple memset test though, so that is expected).

This was on Linux. The OS overhead will probably be larger (potentially much larger) for Mac and Windows.

-- Sean Silva
Reply all
Reply to author
Forward
0 new messages