Other than the observations you have my only concern is the situation
where many lld invocations run in parallel, like in a llvm build where
there many outputs in bin/. Our task system doesn't know about load,
so I worry that it might degrade performance in that case.
Cheers,
Rafael
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Will it detect single-core computers and disable it? What is the
minimum number of threads that can run in that mode?
Is the penalty on dual core computers less than the gains? If you
could have a VM with only two cores, where the OS is running on one
and LLD threads are running on both, it'd be good to measure the
downgrade.
Rafael's concern is also very real. I/O and memory consumption are
important factors on small footprint systems, though I'd be happy to
have a different default per architecture or even carry the burden of
forcing a --no-threads option every run if the benefits are
substantial.
If those issues are not a concern, than I'm in favour!
> - We still need to focus on single-thread performance rather than
> multi-threaded one because it is hard to make a slow program faster just by
> using more threads.
Agreed.
> - We shouldn't do "too clever" things with threads. Currently, we are using
> multi-threads only at two places where they are highly parallelizable by
> nature (namely, copying and applying relocations for each input section, and
> computing build-id hash). We are using parallel_for_each, and that is very
> simple and easy to understand. I believe this was a right design choice, and
> I don't think we want to have something like workqueues/tasks in GNU gold,
> for example.
Strongly agreed.
cheers,
--renato
On 16 November 2016 at 20:44, Rui Ueyama via llvm-dev
<llvm...@lists.llvm.org> wrote:
> I'm thinking to enable --threads by default. We now have real users, and
> they'll be happy about the performance boost.
Will it detect single-core computers and disable it? What is the
minimum number of threads that can run in that mode?
Is the penalty on dual core computers less than the gains? If you
could have a VM with only two cores, where the OS is running on one
and LLD threads are running on both, it'd be good to measure the
downgrade.
On 16 November 2016 at 20:44, Rui Ueyama via llvm-dev
<llvm...@lists.llvm.org> wrote:
> I'm thinking to enable --threads by default. We now have real users, and
> they'll be happy about the performance boost.
Will it detect single-core computers and disable it? What is the
minimum number of threads that can run in that mode?
Is the penalty on dual core computers less than the gains? If you
could have a VM with only two cores, where the OS is running on one
and LLD threads are running on both, it'd be good to measure the
downgrade.
Rafael's concern is also very real. I/O and memory consumption are
important factors on small footprint systems, though I'd be happy to
have a different default per architecture or even carry the burden of
forcing a --no-threads option every run if the benefits are
substantial.
ARM hardware varies greatly, you don't want to restrict that much.
Some boards have one core and 512MB of RAM, others have 8 cores and
1GB, others 8 cores and 32GB, others 96 cores, and so on.
You have mostly answered my questions, though.
Single -> multi thread in a single core is mostly noise and the number
of threads are detected from the number of available cores. That
should be fine on most cases, even on old ARM.
On a mac pro (running linux) the results I got with all cores available:
firefox
master 7.146418217
patch 5.304271767 1.34729488437x faster
firefox-gc
master 7.316743822
patch 5.46436812 1.33899174824x faster
chromium
master 4.265597914
patch 3.972218527 1.07385781648x faster
chromium fast
master 1.823614026
patch 1.686059427 1.08158348205x faster
the gold plugin
master 0.340167513
patch 0.318601465 1.06768973269x faster
clang
master 0.579914119
patch 0.520784947 1.11353855817x faster
llvm-as
master 0.03323043
patch 0.041571719 1.251013574x slower
the gold plugin fsds
master 0.36675887
patch 0.350970944 1.04498356992x faster
clang fsds
master 0.656180056
patch 0.591607603 1.10914743602x faster
llvm-as fsds
master 0.030324313
patch 0.040045353 1.32056917497x slower
scylla
master 3.23378908
patch 2.019191831 1.60152642773x faster
With only 2 cores:
firefox
master 7.174839911
patch 6.319808477 1.13529388384x faster
firefox-gc
master 7.345525844
patch 6.493005841 1.13129820362x faster
chromium
master 4.180752414
patch 4.129515199 1.01240756179x faster
chromium fast
master 1.847296843
patch 1.78837299 1.0329483018x faster
the gold plugin
master 0.341725451
patch 0.339943222 1.0052427255x faster
clang
master 0.581901114
patch 0.566932481 1.02640284955x faster
llvm-as
master 0.03381059
patch 0.036671392 1.08461260215x slower
the gold plugin fsds
master 0.369184003
patch 0.368774353 1.00111084189x faster
clang fsds
master 0.660120583
patch 0.641040511 1.02976422187x faster
llvm-as fsds
master 0.031074029
patch 0.035421531 1.13990789543x slower
scylla
master 3.243011681
patch 2.630991522 1.23261958615x faster
With only 1 core:
firefox
master 7.174323116
patch 7.301968002 1.01779190649x slower
firefox-gc
master 7.339104117
patch 7.466171668 1.01731376868x slower
chromium
master 4.176958448
patch 4.188387233 1.00273615003x slower
chromium fast
master 1.848922713
patch 1.858714219 1.00529578978x slower
the gold plugin
master 0.342383846
patch 0.347106743 1.01379415838x slower
clang
master 0.582476955
patch 0.600524655 1.03098440178x slower
llvm-as
master 0.033248459
patch 0.035622988 1.07141771593x slower
the gold plugin fsds
master 0.369510236
patch 0.376390506 1.01861997133x slower
clang fsds
master 0.661267753
patch 0.683417482 1.03349585535x slower
llvm-as fsds
master 0.030574688
patch 0.033052779 1.08105041006x slower
scylla
master 3.236604638
patch 3.325831407 1.02756801617x slower
Given that we have an improvement even with just two cores available, LGTM.
Cheers,
Rafael
Thanks for the extensive benchmarking! :)
LGTM, too.
cheers,
--renato
The NetBSD version is also PD and uses much more aggressive loop
unrolling:
https://github.com/jsonn/src/blob/trunk/common/lib/libc/hash/sha1/sha1.c
It's still a bit slower than an optimised assembler version, but
typically good enough.
Joerg
What is the total time consumped, not just the real time? When building
a large project, linking is often done in parallel with other tasks, so
wasting a lot of CPU to save a bit of real time is not necessarily a net
win.
Can you try that with a CPU set that explicitly doesn't include the HT
cores? That's more likely to give a reasonable answer for "what is the
thread overhead".
One possible thing to consider is would multi-threading increase
memory usage? I'm most concerned about virtual address space as this
can get eaten up very quickly on a 32-bit machine, particularly when
debug is used. Given that the data set isn't increased when enabling
multiple threads I speculate that the biggest risk would be different
threads mmapping overlapping parts of the files in a non-shared way.
It will be worth keeping track of how much memory is being used as
people may need to alter their maximum number of parallel link jobs to
compensate. From prior experience building clang with debug on a 16-gb
machine using -j8 will bring it to a halt.
Peter
On 17 November 2016 at 03:20, Rui Ueyama via llvm-dev
Sounds like threading isn't beneficial much beyond the second CPU...
Maybe blindly creating one thread per core isn't the best plan...
--renato
parallel.h is pretty simplistic at the moment. Currently it creates
one per SMT. One per core and being lazy about it would probably be a
good thing, but threading is already beneficial and improving
parallel.h an welcome improvement.
Cheers,
Rafael
> Sounds like threading isn't beneficial much beyond the second CPU...
> Maybe blindly creating one thread per core isn't the best plan...
parallel.h is pretty simplistic at the moment. Currently it creates
one per SMT. One per core and being lazy about it would probably be a
good thing, but threading is already beneficial and improving
parallel.h an welcome improvement.
Cheers,
Rafael
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
On 17 November 2016 at 03:20, Rui Ueyama via llvm-dev
<llvm...@lists.llvm.org> wrote:
> Here is the result of running 20 threads on 20 physical cores (40 virtual
> cores).
>
> 19002.081139 task-clock (msec) # 2.147 CPUs utilized
> 12738.416627 task-clock (msec) # 1.000 CPUs utilized
Sounds like threading isn't beneficial much beyond the second CPU...
Maybe blindly creating one thread per core isn't the best plan...
On Thu, Nov 17, 2016 at 4:11 AM, Rafael Espíndola via llvm-dev <llvm...@lists.llvm.org> wrote:> Sounds like threading isn't beneficial much beyond the second CPU...
> Maybe blindly creating one thread per core isn't the best plan...
parallel.h is pretty simplistic at the moment. Currently it creates
one per SMT. One per core and being lazy about it would probably be a
good thing, but threading is already beneficial and improving
parallel.h an welcome improvement.Instead of using std::thread::hardware_concurrency (which is one per SMT), you may be interested in using the facility I added for setting default ThinLTO backend parallelism so that one per physical core is created, llvm::heavyweight_hardware_concurrency() (see D25585 and r284390). The name is meant to indicate that this is the concurrency that should be used for heavier weight tasks (that may use a lot of memory e.g.).
On Nov 17, 2016, at 9:41 AM, Rui Ueyama via llvm-dev <llvm...@lists.llvm.org> wrote:On Thu, Nov 17, 2016 at 6:12 AM, Teresa Johnson via llvm-dev <llvm...@lists.llvm.org> wrote:On Thu, Nov 17, 2016 at 4:11 AM, Rafael Espíndola via llvm-dev <llvm...@lists.llvm.org> wrote:> Sounds like threading isn't beneficial much beyond the second CPU...
> Maybe blindly creating one thread per core isn't the best plan...
parallel.h is pretty simplistic at the moment. Currently it creates
one per SMT. One per core and being lazy about it would probably be a
good thing, but threading is already beneficial and improving
parallel.h an welcome improvement.Instead of using std::thread::hardware_concurrency (which is one per SMT), you may be interested in using the facility I added for setting default ThinLTO backend parallelism so that one per physical core is created, llvm::heavyweight_hardware_concurrency() (see D25585 and r284390). The name is meant to indicate that this is the concurrency that should be used for heavier weight tasks (that may use a lot of memory e.g.).Sorry for my ignorance, but what's the point of running the same number of threads as the number of physical cores instead of HT virtual cores? If we can get better throughput by not running more than one thread per a physical core, it feels like HT is a useless technology.
Indeed, in HT, cores have two execution units on the same cache/bus
line, so memory access is likely to be contrived. Linkers are memory
hungry, which add to the I/O bottleneck which makes most of the gain
disappear. :)
Furthermore, the FP unit is also shared among the ALUs, so
FP-intensive code does not make good use of HT. Not the case, here,
though.
cheers,
-renato
On 17 November 2016 at 17:50, Mehdi Amini via llvm-dev
<llvm...@lists.llvm.org> wrote:
> It depends on the use-case: with ThinLTO we scale linearly with the number
> of physical cores. When you get over the number of physical cores you still
> get some improvements, but that’s no longer linear.
Indeed, in HT, cores have two execution units on the same cache/bus
line, so memory access is likely to be contrived. Linkers are memory
hungry, which add to the I/O bottleneck which makes most of the gain
disappear. :)
Furthermore, the FP unit is also shared among the ALUs, so
FP-intensive code does not make good use of HT. Not the case, here,
though.
It is quite common for SMT to *not* be profitable. I did notice some
small wins by not using it. On an intel machine you can do a quick
check by running with half the threads since they always have 2x SMT.
I had the same experience. Ideally I would like to have a way to
override the number of threads used by the linker.
gold has a plethora of options for doing that, i.e.
--thread-count COUNT Number of threads to use
--thread-count-initial COUNT
Number of threads to use in initial pass
--thread-count-middle COUNT Number of threads to use in middle pass
--thread-count-final COUNT Number of threads to use in final pass
I don't think we need the full generality/flexibility of
initial/middle/final, but --thread-count could be useful (at least for
experimenting). The current interface of `parallel_for_each` doesn't
allow to specify the number of threads to be run, so, assuming lld
goes that route (it may not), that should be extended accordingly.
--
Davide
"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare
I share your view that lld should work fine out-the-box. I think an alternative
is having the option as hidden, maybe. I consider the set of users
tinkering with linker options not large, although there are some
people who like to override/"tune" the linker anyway, so IMHO we
should expose a sane default and let users decide if they care or not
(a similar example is what we do for --thinlto-threads or
--lto-partitions, even if in the last case we still have that set to 1
because it's not entirely clear what's a reasonable number).
I've seen a case where the linker was pinned to a specific subset of the CPUs
and many linker invocations were launched in parallel.
(actually, this is the only time when I've seen --threads for gold used).
I personally don't expect this to be the common use-case, but it's not hard
to imagine complex build systems adopting a similar strategy.
LLD supports multi-threading, and it seems to be working well as you can see in a recent result. In short, LLD runs 30% faster with --threads option and more than 50% faster if you are using --build-id (your mileage may vary depending on your computer). However, I don't think most users even don't know about that because --threads is not a default option.
I'm thinking to enable --threads by default. We now have real users, and they'll be happy about the performance boost.Any concerns?
I can't think of problems with that, but I want to write a few notes about that:- We still need to focus on single-thread performance rather than multi-threaded one because it is hard to make a slow program faster just by using more threads.- We shouldn't do "too clever" things with threads. Currently, we are using multi-threads only at two places where they are highly parallelizable by nature (namely, copying and applying relocations for each input section, and computing build-id hash). We are using parallel_for_each, and that is very simple and easy to understand. I believe this was a right design choice, and I don't think we want to have something like workqueues/tasks in GNU gold, for example.
- Run benchmarks with --no-threads if you are not focusing on multi-thread performance.
Cheers,
Rafael
On 23 November 2016 at 02:41, Sean Silva via llvm-dev
Interesting. Might be worth giving a try again to the idea of creating
the file in anonymous memory and using a write to output it.
On Wed, Nov 23, 2016 at 6:31 AM, Rafael Espíndola <rafael.e...@gmail.com> wrote:Interesting. Might be worth giving a try again to the idea of creating
the file in anonymous memory and using a write to output it.I'm not sure that will help. Even the kernel can't escape some of these costs; in modern 64-bit operating systems when you do a syscall you don't actually change the mappings (TLB flush would be expensive), so the cost of populating the page tables in order to read the pages is still there (and hence the serialization point remains). One alternative is to use multiple processes instead of multiple threads which would remove serialization point by definition (it also seems like it might be less invasive of a change, at least for the copying+relocating step).One experiment might be to add a hack to pre-fault all the files that are used, so that you can isolate that cost from the rest of the link. That will give you an upper bound on the speedup that there is to get from optimizing this.
On Wed, Nov 23, 2016 at 4:53 PM, Sean Silva <chiso...@gmail.com> wrote:On Wed, Nov 23, 2016 at 6:31 AM, Rafael Espíndola <rafael.e...@gmail.com> wrote:Interesting. Might be worth giving a try again to the idea of creating
the file in anonymous memory and using a write to output it.I'm not sure that will help. Even the kernel can't escape some of these costs; in modern 64-bit operating systems when you do a syscall you don't actually change the mappings (TLB flush would be expensive), so the cost of populating the page tables in order to read the pages is still there (and hence the serialization point remains). One alternative is to use multiple processes instead of multiple threads which would remove serialization point by definition (it also seems like it might be less invasive of a change, at least for the copying+relocating step).One experiment might be to add a hack to pre-fault all the files that are used, so that you can isolate that cost from the rest of the link. That will give you an upper bound on the speedup that there is to get from optimizing this.I experimented to add MAP_POPULATE to LLVM's mmap in the hope that it would do what you, but it made LLD 10% slower, and I cannot explain why.
On Wed, Nov 23, 2016 at 6:31 AM, Rafael Espíndola <rafael.e...@gmail.com> wrote:Interesting. Might be worth giving a try again to the idea of creating
the file in anonymous memory and using a write to output it.I'm not sure that will help. Even the kernel can't escape some of these costs; in modern 64-bit operating systems when you do a syscall you don't actually change the mappings (TLB flush would be expensive), so the cost of populating the page tables in order to read the pages is still there (and hence the serialization point remains).