A year has passed since you posted this RFC; could you, please, give a
quick update on the current state of heap profiler development?
(Sorry if you already did so; I looked through llvm-dev mailing list
to no avail -- but perhaps I missed something?)
We (Huawei) are very interested in data cache optimizations; we are
discussing our plans with Maxim and others on the BOLT project github
(https://github.com/facebookincubator/BOLT/issues/178); I would be
really interested to hear your perspective / plans -- either on BOLT
project discussion or here.
One area of particular interest are specific data cache optimizations
you plan (or not?) to implement either in compiler / binary optimizer
/ runtime optimizer based on heap profiler data.
Thank you!
Yours,
Andrey
===
Advanced Software Technology Lab
Huawei
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Hi Teresa,
A year has passed since you posted this RFC; could you, please, give a
quick update on the current state of heap profiler development?
(Sorry if you already did so; I looked through llvm-dev mailing list
to no avail -- but perhaps I missed something?)
We (Huawei) are very interested in data cache optimizations; we are
discussing our plans with Maxim and others on the BOLT project github
(https://github.com/facebookincubator/BOLT/issues/178); I would be
really interested to hear your perspective / plans -- either on BOLT
project discussion or here.
One area of particular interest are specific data cache optimizations
you plan (or not?) to implement either in compiler / binary optimizer
/ runtime optimizer based on heap profiler data.
Thank you!
Yours,
Andrey
===
Advanced Software Technology Lab
Huawei
| Teresa Johnson | | Software Engineer | | tejo...@google.com | |
Thank you for the quick reply! I'm really happy to see the project is
moving forward!
> We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We'll send more details on that when it is fleshed out too.
Just to double check: do you plan to open-source this runtime? --
perhaps as a part of LLVM?
Yours,
Andrey
Hi Teresa,
Thank you for the quick reply! I'm really happy to see the project is
moving forward!
> We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We'll send more details on that when it is fleshed out too.
Just to double check: do you plan to open-source this runtime? --
perhaps as a part of LLVM?
On Wed, Jul 7, 2021 at 7:25 AM Xinliang David Li <xinli...@gmail.com> wrote:
>
>
> On Tue, Jul 6, 2021 at 5:09 AM Andrey Bokhanko <andreyb...@gmail.com> wrote:
>>
>> Hi Teresa,
>>
>> Thank you for the quick reply! I'm really happy to see the project is
>> moving forward!
>>
>> > We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We'll send more details on that when it is fleshed out too.
>>
>> Just to double check: do you plan to open-source this runtime? --
>
>
> It will be in tcmalloc initially.
Got it -- thanks!
We (Huawei) would be happy to contribute -- for example, with extra
testing, on a different set of workloads / environment / hardware
target.
Yours,
Andrey
We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We'll send more details on that when it is fleshed out too.
On Wed, Jul 7, 2021 at 7:25 AM Xinliang David Li <xinli...@gmail.com> wrote:
>> > We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We'll send more details on that when it is fleshed out too.
>>
>> Just to double check: do you plan to open-source this runtime? --
>
>
> It will be in tcmalloc initially.
>
>>
>> perhaps as a part of LLVM?
>
>
> A wrapper runtime layer in LLVM is possible, but not initially.
I wonder how you plan to deliver guidance on what allocations should
be made from the same memory chunk to tcmalloc -- unless you plan to
read data profile directly from the runtime (I doubt so...) this
should be done via some kind of instrumentation done by a compiler /
binary optimizer (a la BOLT) -- right?
Yours,
Andrey
Hi David,
On Wed, Jul 7, 2021 at 7:25 AM Xinliang David Li <xinli...@gmail.com> wrote:
>> > We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We'll send more details on that when it is fleshed out too.
>>
>> Just to double check: do you plan to open-source this runtime? --
>
>
> It will be in tcmalloc initially.
>
>>
>> perhaps as a part of LLVM?
>
>
> A wrapper runtime layer in LLVM is possible, but not initially.
I wonder how you plan to deliver guidance on what allocations should
be made from the same memory chunk to tcmalloc -- unless you plan to
read data profile directly from the runtime (I doubt so...) this
should be done via some kind of instrumentation done by a compiler /
binary optimizer (a la BOLT) -- right?
Yours,
Andrey
Responding to this one first, I'll respond to your other email shortly. Initially we plan to provide hints to tcmalloc via new APIs to help it make allocation decisions (things like hotness and lifetime). The compiler will be responsible for adding these hints, using some method to disambiguate the calling context (e.g. via function cloning, new parameter, etc).
Hi Teresa,One more thing, if you don't mind.On Tue, Jul 6, 2021 at 12:54 AM Teresa Johnson <tejo...@google.com> wrote:We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We'll send more details on that when it is fleshed out too.I played with the current implementation, and became a bit concerned if the current data profile is sufficient for an efficient data allocation optimization.First, there is no information on temporal locality -- only total_lifetime of an allocation block is recorded, not start / end times -- let alone timestamps of actual memory accesses. I wonder what criteria would be used by data profile-based allocation runtime to allocate two blocks from the same memory chunk?
Second, according to the data from [Savage'20], memory accesses affinity (= space distance between temporarily close memory accesses from two different allocated blocks) is crucial: figure #12 demonstrates that this is vital for omnetpp benchmark from SPEC CPU 2017.
Said this, my concerns are based essentially on a single paper that employs specific algorithms to guide memory allocation and measures their impact on a specific set of benchmarks. I wonder if you have preliminary data that validates sufficiency of the implemented data profile for efficient optimization of heap memory allocations?
References:[Savage'20] Savage, J., & Jones, T. M. (2020). HALO: Post-Link Heap-Layout Optimisation. CGO 2020: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, https://doi.org/10.1145/3368826.3377914Yours,Andrey
I understand that you have a big field full of question marks in front
of you and an immensely challenging task. My "tsk, tsk" message was
said with a tongue in cheek.
I was actually just typing up a reply welcoming contributions and to suggest you give the existing profile support a try - I realized I need to add documentation for the usage to llvm/clang's docs which I will do soon but it sounds like you figured it out ok.
It would be difficult to add all of this information for every allocation and particularly every access without being prohibitively expensive. Right now we have the ave/min/max lifetime, and just a single boolean per context indicating whether there was a lifetime overlap with the prior allocation for that context. We can probably expand this a bit to have somewhat richer aggregate information, but like I said, recording and emitting all start/end times and timestamps will be an overwhelming amount of information. As I mentioned in my other response, initially the goal is to provide hints about hotness and lifetime length (short vs long) to the memory allocator so that it can make smarter decisions about how and where to allocate data.
Definitely interested in contributions or ideas on how we could collect richer information with the approach we're taking (allocations tracked by the runtime per context and fast shadow memory based updates for accesses).
Sorry if I sounded too demanding: the easiest thing in the World to do
is to sit on a sofa and throw one demand after another... :-)
I understand that you have a big field full of question marks in front
of you and an immensely challenging task. My "tsk, tsk" message was
said with a tongue in cheek.
On Thu, Jul 8, 2021 at 7:04 PM Teresa Johnson <tejo...@google.com> wrote:
>
>
>
> On Thu, Jul 8, 2021 at 8:56 AM Andrey Bokhanko <andreyb...@gmail.com> wrote:
>>
>> On Thu, Jul 8, 2021 at 6:25 PM Teresa Johnson <tejo...@google.com> wrote:
>>>
>>> Responding to this one first, I'll respond to your other email shortly. Initially we plan to provide hints to tcmalloc via new APIs to help it make allocation decisions (things like hotness and lifetime). The compiler will be responsible for adding these hints, using some method to disambiguate the calling context (e.g. via function cloning, new parameter, etc).
>>
>> Sounds good -- thanks!
>>
>> (though this would increase code size and thus, instruction cache pressure -- tsk, tsk... :-))
>
>
> There are various methods for disambiguating the contexts (e.g. HALO inserts some instructions at a minimal number of callsites to do this), and I suspect a hybrid method will be best in practice. E.g. cloning for hot allocation contexts or callsites, and instructions or parameters or something else on colder allocation contexts and callsites, to balance the code size and dynamic instruction overheads.
>
>
> --
> Teresa Johnson | Software Engineer | tejo...@google.com |
Our team is busy at the moment with BOLT improvements (ARM64, shared libs, Golang support, etc); after that, hopefully we'll be able to join data profile development efforts.
Yours,
Andrey
Hi Teresa,One more thing, if you don't mind.On Tue, Jul 6, 2021 at 12:54 AM Teresa Johnson <tejo...@google.com> wrote:We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We'll send more details on that when it is fleshed out too.I played with the current implementation, and became a bit concerned if the current data profile is sufficient for an efficient data allocation optimization.
First, there is no information on temporal locality -- only total_lifetime of an allocation block is recorded, not start / end times -- let alone timestamps of actual memory accesses. I wonder what criteria would be used by data profile-based allocation runtime to allocate two blocks from the same memory chunk?
This is a big undertaking with good potential but also some uncertainty on how effective such optimizations are for larger workloads, so really appreciate the pioneering effort in LLVM.
We (facebook) are very interested in this too. I've reached out to David and Teresa a while ago about this, and was going to wait for the RFC before having more detailed discussions. But now that we're discussing it, here’s my two cents about the responsibility division between compiler and allocator, and the API.
I think it'd be beneficial if we let compiler do more heavy lighting instead of relying heavily on allocator. If we rely on less magic inside an allocator, we will likely benefit more users who may use different allocators. Otherwise there's a risk that the compiler part may be too coupled with a specific allocator, which limits the overall effectiveness of PGHO outside of that allocator.
This also affects what we want to expose in the new API for hinting the allocator (e.g. provide grouping or arena-like hint computed by compiler vs. passing a number of factors through the API which would help compute that inside allocator). With a general, stable API, hope we won't need to change API when we want to take more runtime info (temporal etc., even just for experiments) into account, or when we improve and leverage more from compiler analysis (I agree that in the long run we should improve compiler analysis).
I've talked with jemalloc folks on our side, and we're flexible to API changes. In this case, it makes sense to avoid abstraction overhead from wrappers.
Looking forward to the RFC and more discussions on this.
Thanks,
Wenlei
This is a big undertaking with good potential but also some uncertainty on how effective such optimizations are for larger workloads, so really appreciate the pioneering effort in LLVM.
We (facebook) are very interested in this too. I've reached out to David and Teresa a while ago about this, and was going to wait for the RFC before having more detailed discussions. But now that we're discussing it, here’s my two cents about the responsibility division between compiler and allocator, and the API.
I think it'd be beneficial if we let compiler do more heavy lighting instead of relying heavily on allocator. If we rely on less magic inside an allocator, we will likely benefit more users who may use different allocators. Otherwise there's a risk that the compiler part may be too coupled with a specific allocator, which limits the overall effectiveness of PGHO outside of that allocator.
This also affects what we want to expose in the new API for hinting the allocator (e.g. provide grouping or arena-like hint computed by compiler vs. passing a number of factors through the API which would help compute that inside allocator). With a general, stable API, hope we won't need to change API when we want to take more runtime info (temporal etc., even just for experiments) into account, or when we improve and leverage more from compiler analysis (I agree that in the long run we should improve compiler analysis).
I've talked with jemalloc folks on our side, and we're flexible to API changes. In this case, it makes sense to avoid abstraction overhead from wrappers.