[llvm-dev] RFC: A binary serialization format for MemProf

Snehasish Kumar via llvm-dev

unread,

Sep 30, 2021, 12:17:29 AM9/30/21

to llvm-dev, Vedant Kumar, Wenlei He, andreyb...@gmail.com, David Li, Teresa Johnson

This RFC contains the following:

* Proposal to introduce a new raw binary serialization format for heap allocation profiles

* Proposal to extend the PGO indexed format to hold heap allocation profiles

We look forward to your feedback on the proposals.

Authors: sneha...@google.com, dav...@google.com, tejo...@google.com

Introduction

—-----------

The design of a sanitizer-based heap profiler (MemProf) was shared with llvm-dev in Jun 2020 [1]. Since then Teresa (@tejohnson) has added a sanitizer based heap profiler which can be enabled with -fmemory-profile. Today it writes out the data in a text format which can be inspected by users. We have used this to drive analyses of heap behaviour at Google. This RFC shares details on a binary serialization format for heap profiling data which can then be reused by the compiler to guide optimizations similar to traditional PGO.

Similar to the existing instrumentation based PGO, the binary heap profile data for PGHO has two forms. One is the raw data format that is used by the profiler runtime, and the other is the indexed profile format used by the compiler. The profile data with the indexed profile data format will be generated by llvm-profdata from the raw profile data offline. This allows a single binary profile file to hold the PGO and Memprof profiling data. Fig 1 below shows the binary format generation and use.

┌──────────────────┐ Raw Profile ┌───────────────┐ Indexed Profile (v8)

│ Compiler Runtime ├─────────────► llvm-profdata ├───► with Memprof data ───► -fprofile-use

└──────────────────┘ └───────────────┘

Fig 1: Memprof binary profile lifecycle

Raw Binary Format

—----------------

The raw binary format contains 4 sections

1. Memprof raw profile header

2. Memory mapping layout

3. Memory Info Block (MIB) records

4. Call stack information

+----------------------+

| Magic |

+----------------------+ H

| Version | E

+----------------------+ A

| Total Size | D

+----------------------+ E

+---------| Map Offset | R

| +----------------------+

| +-----| MIB Offset |

| | +----------------------+

| | | Call Stack Offset |---------+

| | +----------------------+ |

+-------->| Number of | M |

| | Map Entries | A |

| +----------------------+ P |

| | | |

| | Map Entry | S |

| +----------------------+ E |

| | ... | C |

| +----------------------+ T |

| | | I |

| | Map Entry | O |

| +----------------------+ N |

| |

| +----------------------+ |

+---->| Number of | M |

| MIB Entries | I |

+----------------------+ B |

| | |

| MIB Entry | S |

+----------------------+ E |

| ... | C |

+----------------------+ T |

| | I |

| MIB Entry | O |

+----------------------+ N |

|

+----------------------+ |

| Number of |<--------+

| Call Stack Entries | S S

+----------------------+ T E

| Call Stack Entry | A C

+----------------------+ C T

| ... | K I

+----------------------+ O

| Call Stack Entry | N

+----------------------+

Fig 2: Memprof Raw Format

* Memprof raw profile header

The header consists of a unique identifier, version number, total size of the profile as well as offset into the profile for each of the following sections - memory mapping layout, memprof profile entries and call stack information. We do not intend to maintain backwards compatibility for the raw binary format.

* Memory mapping layout

The process memory mappings for the executable segment during profiling are stored in this section. This allows symbolization during post processing for binaries which are built with position independent code. For now all read only, executable mappings are recorded, however in the future, mappings for heap data can also potentially be stored. For each mapping, we record

Start (virtual address)
End (virtual address)
Offset (from the file it was mmap-ed from)
buildId (linker generated hash -Wl,build-id

* MIB Records

The profile information we collect is currently defined in [2]. The section begins with a 8 byte entry containing the number of profile entries in this section. Each entry is uniquely identified by the dynamic calling context of the allocation site. For each entry, various metrics such as access counts, allocation sizes, object lifetimes etc are computed via profiling. This section may contain multiple entries identified by the same callstack id. Subsequent processing to convert and merge multiple raw profiles will deduplicate any such entries.

* Call stack information
Each memprof profile entry is uniquely identified by its dynamic calling context. In this section we record the identifier and its corresponding call stack trace. We use the sanitizer stack depot provided identifier and serialize the trace for each without deduplication. The section begins with a 8b entry containing the number of call stack entries. Each call stack entry contains a 8b field which denotes how many contexts are recorded in this entry. Each frame is identified by an 8b program counter address which holds the call instruction virtual address - 1. Further deduplication is possible though we do not do so at this time.

* Raw Profile Characteristics

To understand the characteristics of raw binary profiles generated by Memprof, we experimented with a clang bootstrap build. We ran 500 invocations of Memprof-instrumented clang on clang source code. Each invocation produced a raw binary profile and we present some aggregate information about them below:

+---------------------------------+--------+---------+---------+

| | Min | Median | Max |

+---------------------------------+--------+---------+---------+

| Unique allocation contexts | 940 | 10661 | 35355 |

+---------------------------------+--------+---------+---------+

| MIB records size (bytes) | 101528 | 1151396 | 3818348 |

+---------------------------------+--------+---------+---------+

| Call stack section size (bytes) | 31080 | 419048 | 1439144 |

+---------------------------------+--------+---------+---------+

| Total Size (bytes) | 133336 | 1571680 | 5258220 |

+---------------------------------+--------+---------+---------+

The size of the header is 48 bytes and the number of read only executable memory maps is usually small (12 in this case) with each map entry consuming 64 bytes. We find that the raw profiles are highly compressible, on average files are compressed by 81%. The largest raw profile in the dataset is ~5M in size. It is compressed to 975K using zip on default settings. In contrast for the same clang build, the instrumented PGO raw profile is ~21M in size (zip compressed 73%). Note that the Memprof profile size is proportional to the number of allocation contexts during profiling.

Since these are profiles from individual invocations, they must be merged before use. This is performed implicitly during the conversion to indexed profile format by llvm-profdata. MIBs are merged based on their call stack.

Memprof extensions for the indexed profile format

—------------------------------------------------

+----------------+

| MAGIC |

+----------------+

| VERSION |

+----------------+

| HASHTYPE |

+--+----------+--+

|HASHTAB OFFSET |-------+

+--+----------+--+ |

+----------------+ |

| | |

| PROFILE | |

| SUMMARY | |

| DATA | |

+----------------+ |

+----------------+ <----+

| |

| OnDisk |

| Chained |

| HashTable |

| |

+----------------+

Fig 3: Existing PGO indexed profile format

During the offline processing step (using llvm-profdata), allocation contexts are pruned and merged. The end result is a collection of unique allocation contexts with access, size and lifetime properties. Contexts are uniquely identified based on the call stack and are stored using a prefix deduplication scheme described in Section “Symbolized Memprof call stack section”.

To fit into the PGO profile format, we need to index the profile using the function name. The only functions that own Memprof profile data are those direct callers of the allocator primitive functions. Thus the profile data mapping in the IR must account for potentially missing frames. Implications on matching of the profile data with the IR is touched upon in Section “Profile Data matching in IR” and will be further detailed in an upcoming RFC.

The Memprof profile data for a particular function can then be just an array of MIB entries. One allocation site in the function can have multiple MIB entries each one of them corresponding to one allocation context.

The change to the existing PGO indexed format is summarized as:

Augment the profile record data structure to include an optional MIB array after the value profile data [3].
Add one additional section before the existing OnDiskChainedHashtable to store the allocation stacks (referenced by the MIBs). This section is after the profile summary data section [4].
Bump the version number.

Memprof portable entry format

—----------------------------

The raw memprof profile entry format is subject to change with version. However, the indexed profile entry must be backwards compatible to ensure that the PGO profile as a whole is backwards compatible. We propose a schema based format - per function description of a Memprof profile entry. Each field is identified by a tag. We propose the following schema:

struct MIBMeta {

enum MIBFieldTag {

// The tag ids remain unchanged.

Unused = 0,

StackID = 1,

AllocCount = 2,

AveSize = 3,

MinSize = 4,

MaxSize = 5,

AveAccessCount = 6,

MinAccessCount = 7,

MaxAccessCount = 8,

AveLifetime = 9,

MinLifetime = 10,

MaxLifetime = 11,

NumMigration = 12,

NumLifetimeOverlaps = 13,

NumSameAllocCPU = 14,

NumSameDeallocCPU = 15

};

enum MIBFieldType {

Unused = 0,

UINT8 = 1,

UINT16 = 2,

UINT32 = 3, // Varint encoded

UINT64 = 4, // Varint encoded

};

// Mapping of tags to their descriptive names

const char *MIBFieldName[] = {

"",

"StackID",

"AllocCount",

…

"NumSameDeallocCPU"

};

// Mapping of tags to their types

uint8 MIBFieldType [] = {

0, // unused

MIBMeta::MIBFieldType::UINT64,

….

};

To make the field tag, field name, and field type declarations always in sync, a shared .inc file will be used. This file will be shared between compiler-rt and llvm/lib/ProfileData libraries. Dependencies across the compiler-rt project are not recommended for isolation.

// MIBDef.inc file

// define MIBEntryDef(tag, name, type) before inclusion

MIBEntryDef(StackID = 1, "StackID",

MIBMeta::MIBFieldType::UINT64)

MIBEntryDef(AllocCount = 2, "AllocCount",

MIBMeta::MIBFieldType::UINT32)

...

MIBEntryDef(NumSameDeallocCPU=13,"NumSameDeallocCPU",

MIBMeta::MIBFieldType:UINT8)

enum MIBFieldTag {

StartTag = 0,

#define MIBEntryDef(tag, name, type) tag

#include "MIBDef.inc"

#undef MIBEntryDef

};

const char *MIBFieldName {

#define MIBEntryDef(tag, name, type) name

"",

#include "MIBEntryDef.inc"

#undef MIBEntryDef

};

uint8 MIBFieldType {

#define MIBEntryDef(tag, name, type) type

0, // not used

#include "MIBEntryDef.inc"

#undef MIBDefEntry

};

The InstrProfRecord for each function will hold the schema and an array of Memprof info blocks, one for each unique allocation context.

Symbolized Memprof call stack section

—------------------------------------

This section holds the symbolized version of the call stack section in the raw profile. This is necessary to enable the compiler to map recorded runtime PC addresses to source locations during recompilation. For space efficiency, this section is split into three subsections: 1. stack entry table 2. file path table 3. string table.

Fig 4 shows the relationship between the three tables.

STACK ENTRY FILE PATH STRING

TABLE TABLE TABLE

┌────────┐ ┌──┬──┬──┐ ┌─────────┐

│ │ │ │ │ │ │10 abc │

│ │ ├──┼──┼──┤ ├─────────┤

│ │ │ │ │ │◄┐ │11 def │

│ │ ├──┼──┼──┤ │ ├─────────┤

├──┬──┬──┤ │ │ │ │ │ │12 ghi │

┌───┤03│LN│DI│ ├──┼──┼──┤ │ ├─────────┤

│ ├──┴──┴──┤ ┌───►│03│13│01├─┘ ┌─►│13 XY.h │

│ └────────┘ │ └──┴─┬┴──┘ │ └─────────┘

│ │ │ │

└────────────────┘ └────────┘

Fig 4. The Stack Entry, File Path and String Table.

LN = Line Number, DI = Discriminator

* Stack Entry Table: Each uniquely identified symbolized call stack consists of a variable number of callsites. In the indexed format, each callsite needs to be represented as file:line:discriminator (as shown in Fig 4). The call stack location is a 64-bit field, but it is split into three subfields: file table index, line number, and the discriminator value. The file table index is a pointer to a leaf node in the prefix encoded scheme described below.

* File path table using prefix encoding: Full path filenames have lots of common substrings due to the directory structure so they can be compressed. One simple scheme is to use the reverse prefix tree representation. In this representation, the name string of a directory at each level (not including prefixes) is represented by a node, and it is linked to its parent node. To summarize, the file path table is represented as an array of nodes organized as a forest of reversed tree structures.

For instance, the strings
“abc/def/ghi/XY.c”,
“abc/def/ghi/XY.h”,

“abc/def/jkl/UV.c”

are represented as

+-----------+<------+

| abc | |

+----->+---->+-----------+ |

| | | def +-------+

| | +-----------+

| +-----+ ghi |<-------+----+

| +-----------+ | |

+-+--------->| jkl | | |

| +-----------+ | |

| | XY.c +--------+ |

| +-----------+ |

+----------+ UV.c | |

+-----------+ |

| XY.h +-------------+

+-----------+

Fig 5. Prefix tree representation of paths in the file path table

Each parent link implies a path separator ‘/’. Furthermore we represent each file or directory string as an integer offset into the string table (see Fig 4). Thus each node holds an offset into the string table and a pointer to the parent (interior) directory node.

* String table: To remove redundancies across prefix tree nodes in the file path encoding, we use a string table which stores the mapping of string to a unique id. The id can be simplified as the implicit offset into the table.

Thus this representation exploits redundancy in the source file path prefix where there are a large number of source files in a small number of deeply nested directories.

Symbolizing the raw PC addresses in post-processing using llvm-profdata requires the binary to also be provided as input. As an alternative, we will also experiment with incrementally generating the symbolized call stack section as part of the raw profile dump at the cost of increased profiling overhead.

Profile Data matching in IR

—--------------------------

When matching the profile data, we may have already early inlined the direct caller of the allocation into its caller(s). This means we need to take some additional steps to identify the matching MIBs. For example, consider the following partial call graph:

┌────────────────────┐ ┌────────────────────┐

A │ Call B; debug loc1 │ C │ Call B; debug loc2 │

└───────┬────────────┘ └───────────┬────────┘

│ │

│ ┌─────────────────────────┐ │

└────► Call malloc; debug loc3 ◄─────┘

B └─────────────────────────┘

There will be 2 MIB entries, one for each context (A->B and C->B). As noted earlier, the MIB profile entries will be owned by the function calling the allocation function. Therefore, we will keep both MIB entries associated with function B in the profile.

If early inlining (i.e. before profile matching) inlines B into A but not into C it will look like the following when we try to match the profile:

┌────────────────────┐

┌────────────────────────┐ C │ Call B; debug loc2 │

A │Call malloc; debug loc3 │ └───────────┬────────┘

│ ; inlined at │ │

│ ; debug loc1 │ ┌──────────────┴───────────┐

└────────────────────────┘ B │ Call malloc; debug loc3 │

└──────────────────────────┘

Because the MIB corresponding to the A->B context is associated with function B in the profile, we do not find it by looking at function A’s profile when we see function A’s malloc call during matching. To address this we need to keep a correspondence from debug locations to the associated profile information. The details of the design will be shared in a separate RFC in the future.

[1] https://lists.llvm.org/pipermail/llvm-dev/2020-June/142744.html

[2] https://git.io/JzdRa

[3] https://git.io/JzdRR

[4] https://git.io/JzdRN

Xinliang David Li via llvm-dev

unread,

Sep 30, 2021, 1:13:26 AM9/30/21

to Snehasish Kumar, llvm-dev, andreyb...@gmail.com

FYI you can also view the RFC here https://groups.google.com/g/llvm-dev/c/h1DvHguLpxU , which displays the diagrams better (without extra wide space).

David

Andrey Bokhanko via llvm-dev

unread,

Oct 1, 2021, 4:15:05 PM10/1/21

to Snehasish Kumar, llvm-dev, David Li

Hi Snehasish, David and Theresa,

I'm really glad to see the steady progress in this area!

It looks like the format is pretty much language independent
(correct?) -- so it can be applied not only to C/C++, but other
languages (Rust) and even toolchains (Go) as well? If you have already
considered using data profile for non-C/C++, may I kindly ask you to
share your thoughts on this?

Yours,
Andrey
===
Advanced Software Technology Lab
Huawei

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Snehasish Kumar via llvm-dev

unread,

Oct 1, 2021, 5:26:02 PM10/1/21

to Andrey Bokhanko, llvm-dev, David Li

Hi Andrey,

The serialization format is language independent, though our focus is C/C++. Note that our instrumentation is based on the LLVM sanitizer infrastructure and should work for Rust (supports building with sanitizers [1]). We have not considered using the data profile for non-C/C++ codes.

Regards,

Snehasish

[1] https://doc.rust-lang.org/beta/unstable-book/compiler-flags/sanitizer.html

Teresa Johnson via llvm-dev

unread,

Oct 1, 2021, 5:32:38 PM10/1/21

to Snehasish Kumar, llvm-dev, Andrey Bokhanko, David Li

I was going to respond similarly, and add a note that it isn't clear that gollvm (LLVM-based Go compiler) supports either PGO or the sanitizers, so that may be more difficult than Rust which does. As Snehasish notes, we are focused on C/C++, but this will all be done in the LLVM IR level and should be language independent in theory.

Teresa

--

Teresa Johnson |

Software Engineer |

tejo...@google.com |

Xinliang David Li via llvm-dev

unread,

Oct 1, 2021, 5:54:33 PM10/1/21

to Teresa Johnson, Snehasish Kumar, llvm-dev, Andrey Bokhanko

On Fri, Oct 1, 2021 at 10:32 AM Teresa Johnson <tejo...@google.com> wrote:
>
> I was going to respond similarly, and add a note that it isn't clear that gollvm (LLVM-based Go compiler) supports either PGO or the sanitizers, so that may be more difficult than Rust which does. As Snehasish notes, we are focused on C/C++, but this will all be done in the LLVM IR level and should be language independent in theory.

+Than McIntosh to comment more on PGO and sanitizer support for gollvm.

David

> Teresa
>
> On Fri, Oct 1, 2021 at 10:25 AM Snehasish Kumar <sneha...@google.com> wrote:
>>
>> Hi Andrey,
>>
>> The serialization format is language independent, though our focus is C/C++. Note that our instrumentation is based on the LLVM sanitizer infrastructure and should work for Rust (supports building with sanitizers [1]). We have not considered using the data profile for non-C/C++ codes.
>>
>> Regards,
>> Snehasish
>>
>> [1] https://doc.rust-lang.org/beta/unstable-book/compiler-flags/sanitizer.html
>>
>> On Fri, Oct 1, 2021 at 9:14 AM Andrey Bokhanko <andreyb...@gmail.com> wrote:
>>>
>>> Hi Snehasish, David and Theresa,
>>>
>>> I'm really glad to see the steady progress in this area!
>>>
>>> It looks like the format is pretty much language independent
>>> (correct?) -- so it can be applied not only to C/C++, but other
>>> languages (Rust) and even toolchains (Go) as well? If you have already
>>> considered using data profile for non-C/C++, may I kindly ask you to
>>> share your thoughts on this?
>>>
>>> Yours,
>>> Andrey
>>> ===
>>> Advanced Software Technology Lab
>>> Huawei
>>>
>>> On Thu, Sep 30, 2021 at 1:17 AM Snehasish Kumar <sneha...@google.com> wrote:
>>> >
>
>
>
> --
> Teresa Johnson | Software Engineer | tejo...@google.com |

Than McIntosh via llvm-dev

unread,

Oct 1, 2021, 6:35:52 PM10/1/21

to Xinliang David Li, Snehasish Kumar, llvm-dev, Andrey Bokhanko

Hi all,

Gollvm does not support the sanitizers at the moment. There is some support for PGO (driver plumbing and such) but it is in a fairly rudimentary state, mainly there to allow running experiments. It would need a good deal more work to make it production-quality.

Thanks, Than

Andrey Bokhanko via llvm-dev

unread,

Oct 4, 2021, 9:38:11 AM10/4/21

to Teresa Johnson, Snehasish Kumar, llvm-dev, David Li

Thanks Teresa and others for the clarification!

On Fri, Oct 1, 2021 at 8:32 PM Teresa Johnson <tejo...@google.com> wrote:

I was going to respond similarly, and add a note that it isn't clear that gollvm (LLVM-based Go compiler) supports either PGO or the sanitizers, so that may be more difficult than Rust which does. As Snehasish notes, we are focused on C/C++, but this will all be done in the LLVM IR level and should be language independent in theory.

Let me note that I specifically meant gc (Google's standard Go compiler), not gollvm. IMHO, there is an intrinsic value of data formats being unified among different toolchains -- as very well demonstrated by DWARF.

(Yes, I'm aware that gc doesn't support even ages-long instruction profiling. One of the reasons is the apparent lack of implemented optimizations that can directly benefit from profiling. In case of memory profiling, the use case is clear. Also, given that BOLT helps Go a lot (up to +20% speed-up on our internal tests), I expect the same for memory profiling, which will warrant extending gc capabilities to use MemProf format.)

Yours,

Andrey

Teresa Johnson via llvm-dev

unread,

Oct 4, 2021, 2:55:25 PM10/4/21

to Andrey Bokhanko, Than McIntosh, Snehasish Kumar, llvm-dev, David Li

+Than McIntosh again to comment on the gc question below.

On Mon, Oct 4, 2021 at 2:38 AM Andrey Bokhanko <andreyb...@gmail.com> wrote:

Thanks Teresa and others for the clarification!

On Fri, Oct 1, 2021 at 8:32 PM Teresa Johnson <tejo...@google.com> wrote:
I was going to respond similarly, and add a note that it isn't clear that gollvm (LLVM-based Go compiler) supports either PGO or the sanitizers, so that may be more difficult than Rust which does. As Snehasish notes, we are focused on C/C++, but this will all be done in the LLVM IR level and should be language independent in theory.

Let me note that I specifically meant gc (Google's standard Go compiler), not gollvm. IMHO, there is an intrinsic value of data formats being unified among different toolchains -- as very well demonstrated by DWARF.

(Yes, I'm aware that gc doesn't support even ages-long instruction profiling. One of the reasons is the apparent lack of implemented optimizations that can directly benefit from profiling. In case of memory profiling, the use case is clear. Also, given that BOLT helps Go a lot (up to +20% speed-up on our internal tests), I expect the same for memory profiling, which will warrant extending gc capabilities to use MemProf format.)

I don't think the gc compiler even involves llvm as it is written in Go. So that's definitely outside the scope of our work. I'm not personally very familiar with Go compiler toolchains and their roadmaps, but Than can probably comment.

Teresa

Than McIntosh via llvm-dev

unread,

Oct 4, 2021, 3:53:09 PM10/4/21

to Teresa Johnson, Snehasish Kumar, llvm-dev, David Li, Andrey Bokhanko

>>I don't think the gc compiler even involves llvm as it is written in Go.

Correct.

>>I'm not personally very familiar with Go compiler toolchains and their roadmaps, but Than can probably comment.

I don't see any reason why something similar to what Teresa and Snehasish are proposing couldn't be implemented for the Go gc-based toolchain (with a significant amount of effort)-- from my reading it looks fairly language independent.

True, as previously pointed out, the gc-based Go toolchain currently doesn't support ASAN and lacks any sort of PGO/FDO capability, but this is not written in stone. FDO support, along with improving the compiler back end to exploit profile data (via inlining, basic block layout, etc) is something that could be added if need be. Go's priorities have simply been different from those of C/C++.

>IMHO, there is an intrinsic value of data formats being unified among different toolchains -- as very well demonstrated by DWARF

Comparison with DWARF seems a bit odd here. I agree that unified formats can be useful, but I would point out that there is a great deal of administrative overhead associated with standards like DWARF (committee meetings, heavyweight processes for reaching consensus on new features, release cycles measured in years, etc).

Go (for example) uses its own object file format, as opposed to using an existing standard format (e.g. ELF or PE/COFF). The ability to modify and evolve the object file format is a huge enabler when it comes to rolling out new features. It was a key element in the last two big Go projects I've worked on; had we been stuck with an existing object file format, the work would have been much more difficult.

Than

Hongtao Yu via llvm-dev

unread,

Oct 4, 2021, 9:49:52 PM10/4/21

to via llvm-dev, v...@apple.com, Wenlei He, andreyb...@gmail.com, dav...@google.com, tejo...@google.com, Lei Wang

Hello Snehasish,

It’s great to see this RFC about the profile format of the heap profiler since last the RFC. I have a couple questions about how calling contexts are stored and processed by the compiler:

How are recursive allocation contexts stored? Wondering if there’s any recursive compression performed. For example, a tree-based construction algorithm may create tree nodes recursively. Is each tree node object modeled by its unique dynamic context?
Will the contexts of a leaf function merged during compilation when the leaf function is not inlined? If so, where does the merging happen?

We have seen similar issues when working on CSSPGO (Context-sensitive PGO). I’m wondering if they could be a problem here.

Thanks,

Hongtao

Xinliang David Li via llvm-dev

unread,

Oct 5, 2021, 12:21:42 AM10/5/21

to Hongtao Yu, via llvm-dev, Lei Wang, andreyb...@gmail.com

On Mon, Oct 4, 2021 at 2:41 PM Hongtao Yu <h...@fb.com> wrote:

Hello Snehasish,

It’s great to see this RFC about the profile format of the heap profiler since last the RFC. I have a couple questions about how calling contexts are stored and processed by the compiler:

How are recursive allocation contexts stored? Wondering if there’s any recursive compression performed. For example, a tree-based construction algorithm may create tree nodes recursively.

This is handled by memprof/sanitizer runtime. Teresa can point to the answer.

Is each tree node object modeled by its unique dynamic context?

I suppose the depth of the allocation contexts for tree nodes can be different.

Will the contexts of a leaf function merged during compilation when the leaf function is not inlined? If so, where does the merging happen?

The merging/pruning/trimming can happen during the offline processing step to reduce profile size. The merging should be based on whether contexts are providing differenting information or not. Merging contexts with different profile properties won't help optimizations.

thanks,

David 

Teresa Johnson via llvm-dev

unread,

Oct 5, 2021, 12:22:03 AM10/5/21

to Xinliang David Li, via llvm-dev, Lei Wang, Hongtao Yu, andreyb...@gmail.com

On Mon, Oct 4, 2021 at 4:18 PM Xinliang David Li <dav...@google.com> wrote:

On Mon, Oct 4, 2021 at 2:41 PM Hongtao Yu <h...@fb.com> wrote:

Hello Snehasish,

It’s great to see this RFC about the profile format of the heap profiler since last the RFC. I have a couple questions about how calling contexts are stored and processed by the compiler:

How are recursive allocation contexts stored? Wondering if there’s any recursive compression performed. For example, a tree-based construction algorithm may create tree nodes recursively.
This is handled by memprof/sanitizer runtime. Teresa can point to the answer.

Actually, they are not currently compressed by the runtime, and memprof follows the same mechanism as the sanitizer which simply truncates long stack contexts. This is something we can and should improve in the memprof runtime, however, and fold the recursive call chains.

Is each tree node object modeled by its unique dynamic context?

I suppose the depth of the allocation contexts for tree nodes can be different.
Will the contexts of a leaf function merged during compilation when the leaf function is not inlined? If so, where does the merging happen?

Can you give a specific example of the case you are concerned about?

Teresa

Snehasish Kumar via llvm-dev

unread,

Oct 5, 2021, 12:37:52 AM10/5/21

to h...@fb.com, llvm-dev, David Li, Andrey Bokhanko

Hi Hongtao,

> How are recursive allocation contexts stored? Wondering if there’s any recursive compression performed. For example, a tree-based construction algorithm may create tree nodes recursively. Is each tree node object modeled by its unique dynamic context?
There is no special handling of recursive calling contexts, we store the entire unique dynamic calling context as the identifier.

> Will the contexts of a leaf function merged during compilation when the leaf function is not inlined? If so, where does the merging happen?

During compilation, each allocation site may be annotated with one or more heap allocation info blocks each identified by a unique dynamic calling context. We will not merge heap profile information across unique contexts as one of our immediate goals is to distinguish between hot and cold allocation contexts. The mechanism to distinguish the allocation contexts involve cloning or parameterization and Teresa will present the details in an upcoming RFC.

Hongtao Yu via llvm-dev

unread,

Oct 5, 2021, 12:52:11 AM10/5/21

to Snehasish Kumar, llvm-dev, David Li, Andrey Bokhanko

Hi Snehasish, Teresa and David,

Thanks for the information. I have another question about the optimized (pass2) build. Does the runtime heap allocator identify a heap object using calling contexts too? Would sort of virtual unwinding plus processing of debug inline contexts needed?

Thanks,

Hongtao

From: Snehasish Kumar <sneha...@google.com>
Sent: Monday, October 4, 2021 5:37 PM
To: Hongtao Yu <h...@fb.com>
Cc: Teresa Johnson <tejo...@google.com>; Andrey Bokhanko <andreyb...@gmail.com>; llvm-dev <llvm...@lists.llvm.org>; Vedant Kumar <v...@apple.com>; Wenlei He <wen...@fb.com>; David Li <dav...@google.com>
Subject: Re: RFC: A binary serialization format for MemProf

Snehasish Kumar via llvm-dev

unread,

Oct 5, 2021, 1:43:43 AM10/5/21

to Hongtao Yu, llvm-dev, David Li, Andrey Bokhanko

Hi Hongtao,

Consider the following example with two contexts -

foo // This function is hot
bar
malloc()
baz // This function is cold
bar
malloc()

The profile loader will annotate the call to malloc() in the IR with
two contexts and their characteristics. Since one context is hot and
the other is cold, their characteristics differ (as David noted) and
we will not merge the contexts during profile processing. Now there
are a few ideas on how the allocator can determine whether this is a
hot or cold allocation at runtime --

1. Static deduplication via cloning - we can clone bar and rewrite the
call to malloc with a special call which indicates that it is cold.
The second example above would then look like --
baz
bar_cold
malloc_cold()
While this involves code duplication potentially increasing
icache/itlb footprint, for cold enough contexts we can tune the
threshold so that the benefit outweighs the cloning costs.

2. Parameterization - we can parameterize bar to carry additional
information that this current context is cold. Thus the code would
look like this --
baz
bar_parameterized (/*is_cold_context=*/ true)
if (is_cold_context) malloc_cold()
else malloc()
This will lead to code bloat on hot paths. This can also lead to a
large amount of parameterization when there are interleaving cold
contexts, increasing register pressure along hot paths. An optimized
approach may be able to pack the information using some encoding.

3. Runtime calling context identification - As you suggested, the
allocator can identify the heap object using the calling context. An
implementation might look like this --
baz
bar
malloc()
id = get_context()
if (is_context_cold(id)) malloc_cold
else ...
I believe the overheads of this approach is fairly high since the
context identification will happen at each dynamic call. E.g Sumner et
al measured the overhead to be ~2% overall for medium size programs in
"Precise Calling Context Encoding". We anticipate runtime
identification of calling contexts on large workloads to be
prohibitively high.

Note that these are just a few ideas and we are currently leaning
towards (1). Happy to hear about any motivating data you may have for
these approaches, though an in-depth discussion of this should
probably be reserved for an RFC which Teresa will share soon.

Hongtao Yu via llvm-dev

unread,

Oct 5, 2021, 3:52:14 AM10/5/21

to Snehasish Kumar, llvm-dev, David Li, Andrey Bokhanko

Hi Snehasish,

Thanks for the analysis of different potential solutions to context identification problem. Regarding #3, I’m wondering if a frame-chain based virtual unwinding for heap allocation only could speed it up. But yeah, perhaps it would be more appropriate to move the discussion there in the upcoming RFC from Teresa.

Thanks,

Hongtao

Wenlei He via llvm-dev

unread,

Oct 7, 2021, 4:19:22 PM10/7/21

to Snehasish Kumar, llvm-dev, Vedant Kumar, andreyb...@gmail.com, David Li, Teresa Johnson, Hongtao Yu

Thanks for sharing the progress and details on the binary format. Overall this looks like a clean design that fits current PGO profile format with extensions.

Some high level comments:

Does memprof/PGHO work together with today's IRPGO today, i.e. can we have one instrumented build to collect both PGO and PGHO profile, or we will need separate PGO instrumentation builds for each, in which case CSPGO + PGHO would need three iterations of training and build, which would be significant operational cost..
I think some of the problems memprof faced when dealing with storing calling context and mapping context to IR is very similar to CSSPGO. I'm wondering if it makes sense to promote some existing infrastructure to be more general beyond just serving CSSPGO. One example is the IR mapping you mentioned (quoted below). In CSSPGO, we have the exact same need, and it's handled by `SampleContextTracker` which queries a context trie using an instruction/DILocation.

> Because the MIB corresponding to the A->B context is associated with function B in the profile, we do not find it by looking at function A’s profile when we see function A’s malloc call during matching. To address this we need to keep a correspondence from debug locations to the associated profile information.

The serialization of calling context, pruning of calling context are also example of shared problems, and we've put in some effort to have effective solutions (e.g. offline preinliner for most effective pruning, which I think could be adapted to help keep most important allocation context). Perhaps some of the frameworks can be merged, so LLVM has general context aware PGO support that can be leverage by different kinds of PGO (IRPGO, PGHO, CSSPGO). If you think this is worth pursuing, we’d be happy to help too.

Xinliang David Li via llvm-dev

unread,

Oct 7, 2021, 4:40:11 PM10/7/21

to Wenlei He, Snehasish Kumar, Hongtao Yu, llvm-dev, andreyb...@gmail.com

Just a quick note -- IRPGO profile is not deterministic with multi-threaded programs due to contentions (there is of course atomic update mode, but it can be slow). Asynchronous dumping is another reason that the profile is not guaranteed to be repeatable.

David

Snehasish Kumar via llvm-dev

unread,

Oct 7, 2021, 7:06:42 PM10/7/21

to Xinliang David Li, llvm-dev, Hongtao Yu, andreyb...@gmail.com

Hi Wenlei,

Thanks for taking a look! Added responses inline.

On Thu, Oct 7, 2021 at 9:29 AM Xinliang David Li <dav...@google.com> wrote:
>
> Just a quick note -- IRPGO profile is not deterministic with multi-threaded programs due to contentions (there is of course atomic update mode, but it can be slow). Asynchronous dumping is another reason that the profile is not guaranteed to be repeatable.
>
> David
>
> On Thu, Oct 7, 2021 at 9:18 AM Wenlei He <wen...@fb.com> wrote:
>>
>> Thanks for sharing the progress and details on the binary format. Overall this looks like a clean design that fits current PGO profile format with extensions.
>>
>>
>>
>> Some high level comments:
>>
>>
>>

Our focus is to have a single combined IR instrumentation and PGHO
instrumentation phase to keep operational costs low. For CSPGO today,
this would be the second IR instrumentation phase. We also intend to
support a separate PGHO instrumentation phase.

>> Does memprof/PGHO work together with today's IRPGO today, i.e. can we have one instrumented build to collect both PGO and PGHO profile, or we will need separate PGO instrumentation builds for each, in which case CSPGO + PGHO would need three iterations of training and build, which would be significant operational cost..

Yes, the context tracker is quite relevant to the IR matching need.
Teresa will share the detailed design soon and we can evaluate the
benefit of reusing the existing logic for CSSPGO. I think this is
orthogonal to this RFC (serialization format) so we can defer to the
next one for a detailed discussion.

>> I think some of the problems memprof faced when dealing with storing calling context and mapping context to IR is very similar to CSSPGO. I'm wondering if it makes sense to promote some existing infrastructure to be more general beyond just serving CSSPGO. One example is the IR mapping you mentioned (quoted below). In CSSPGO, we have the exact same need, and it's handled by `SampleContextTracker` which queries a context trie using an instruction/DILocation.
>>
>>
>>
>> > Because the MIB corresponding to the A->B context is associated with function B in the profile, we do not find it by looking at function A’s profile when we see function A’s malloc call during matching. To address this we need to keep a correspondence from debug locations to the associated profile information.
>>
>>
>>

We intend to retain as much of the calling context information until
the IR matching. This is where we can leverage common solutions. We
would be happy to generalize where appropriate and intend to tackle
this topic in detail in the next RFC.

>> The serialization of calling context, pruning of calling context are also example of shared problems, and we've put in some effort to have effective solutions (e.g. offline preinliner for most effective pruning, which I think could be adapted to help keep most important allocation context). Perhaps some of the frameworks can be merged, so LLVM has general context aware PGO support that can be leverage by different kinds of PGO (IRPGO, PGHO, CSSPGO). If you think this is worth pursuing, we’d be happy to help too.
>>
>>
>>
>> More on the details:
>>
>>
>>

As David mentioned, keeping the PGHO profile deterministic is a
non-goal since IR PGO profile is non-deterministic.

>> I saw that MemInfoBlock contains alloc/dealloc cpuid, does that make memprof profile non-deterministic in the sense that running memprof twice on the exact program and input would yield bit-wise different memory profile? I think IR PGO profile is deterministic?
>>
>>
>>

We need to use the file path instead of the function to be able to
distinguish COMDAT functions. The line_offset based matching is more
resilient if the entire function is moved, I think it's a good idea
and we can incorporate it into the IR matching phase.

>> Why do we use `file:line:discriminator` instead of `func:line_offset:discriminator `? The later would be more resilient to source change. If function name string is too long, we could perhaps leverage the MD5 encoding used by sample PGO?
>>
>>
>>

While we only intend to support Memprof optimizations for the main
binary, retaining all executable mappings allow future analysis tools
to symbolize shared library code.

>> Is the design of mmap section (quoted below) trying to support memprof for multiple binaries in the same process at the same time, or mainly for handling multiple non-consecutive executable segments for a single binary?
>>
>>
>>
>> > The process memory mappings for the executable segment during profiling are stored in this section. This allows symbolization during post processing for binaries which are built with position independent code. For now all read only, executable mappings are recorded, however in the future, mappings for heap data can also potentially be stored.
>>
>>

Yes, we do intend to support Memprof profile section merging via
`llvm-profdata merge`. The schema overhead per function is low now, so
we opted for function granularity. We can revisit if the overheads are
high or if the IR metadata scheme intends to keep it at module
granularity (in which case we don't need the extra fidelity).

>> Do we need each function record to have its own schema, do we expect different functions to use different versions/schemas? The is very flexible, but wondering what’s the use case. If the schema is for compatibility across versions, perhaps a file level scheme would be enough?
>>
>>
>>
>> > The InstrProfRecord for each function will hold the schema and an array of Memprof info blocks, one for each unique allocation context.
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Wenlei
>>

Wenlei He via llvm-dev

unread,

Oct 7, 2021, 8:00:12 PM10/7/21

to Snehasish Kumar, Xinliang David Li, llvm-dev, andreyb...@gmail.com, Hongtao Yu

Thanks for the reply and clarification. Having a single combined IR instrumentation and PGHO instrumentation sounds good.

I’m also wondering if you have any data you could share that tells the overall benefit of memprof driven optimization since last RFC, perhaps with some early prototype and on small/synthetic workload? Asking because even though this all looks promising, from runtime support to binary format, later profile loader and optimization, there’s non-trivial complexity being added to a few places.

Thanks,

Wenlei

Snehasish Kumar via llvm-dev

unread,

Oct 8, 2021, 4:42:36 PM10/8/21

to Wenlei He, llvm-dev, Hongtao Yu, Xinliang David Li, andreyb...@gmail.com

Hi Wenlei,

We are still working on an end to end prototype and do not have any
data to share at this time. Our work is motivated by manual tuning of
a large internal workload which leverages tcmalloc support for hotness
based memory pooling (to be open sourced soon). With Memprof,
preliminary analyses indicate we can automatically cover all manually
identified cases and identify more opportunities. In the future we aim
to support more hot-cold memory splitting optimizations such as the
schemes described in "Software-Defined Far Memory in Warehouse-Scale
Computers" ASPLOS 2019. We look forward to sharing more data and a
prototype once the Memprof IR annotation and optimization consumer RFC
has been vetted.

Regards,
Snehasish

Xinliang David Li via llvm-dev

unread,

Oct 8, 2021, 11:39:25 PM10/8/21

to Snehasish Kumar, llvm-dev, Hongtao Yu, andreyb...@gmail.com

On Fri, Oct 8, 2021 at 9:42 AM Snehasish Kumar <sneha...@google.com> wrote:

Hi Wenlei,

We are still working on an end to end prototype and do not have any
data to share at this time. Our work is motivated by manual tuning of
a large internal workload which leverages tcmalloc support for hotness
based memory pooling (to be open sourced soon). With Memprof,
preliminary analyses indicate we can automatically cover all manually
identified cases and identify more opportunities. In the future we aim
to support more hot-cold memory splitting optimizations such as the
schemes described in "Software-Defined Far Memory in Warehouse-Scale
Computers" ASPLOS 2019. We look forward to sharing more data and a
prototype once the Memprof IR annotation and optimization consumer RFC
has been vetted.

Yes, the initial goal is targeting coarse grain smart-placement for locality and memory savings. Longer term it will also handle fine grain placement strategies which will require more extensive interaction with the memory allocator layer.

David

Teresa Johnson via llvm-dev

unread,

Oct 10, 2021, 3:34:57 PM10/10/21

to Snehasish Kumar, llvm-dev, Hongtao Yu, Xinliang David Li, andreyb...@gmail.com

In fact, we need to retain the calling context beyond matching, so that we can perform the context disambiguation transformations that Snehasish described in an earlier email. The next RFC will focus on the IR metadata needed to carry the PGHO data as well as the context.

From reading through the CSSPGO RFC it sounds like the context info is never annotated onto the IR, but rather just used during the sample PGO loading/inlining step to help generate more accurate IR prof md counts - is that correct? In that case perhaps some of the infrastructure can be shared for performing the matching for already inlined contexts, which I think is what the ContextTrieNode structures are used for from what I can tell perusing the code. It is a little unclear to me - how is the profile for a partially inlined context found in the data structure - i.e. how do you look up the ContextTrieNode for a given out of line function?

Thanks,

Teresa

>> The serialization of calling context, pruning of calling context are also example of shared problems, and we've put in some effort to have effective solutions (e.g. offline preinliner for most effective pruning, which I think could be adapted to help keep most important allocation context). Perhaps some of the frameworks can be merged, so LLVM has general context aware PGO support that can be leverage by different kinds of PGO (IRPGO, PGHO, CSSPGO). If you think this is worth pursuing, we’d be happy to help too.
>>
>>
>>
>> More on the details:
>>
>>
>>
As David mentioned, keeping the PGHO profile deterministic is a
non-goal since IR PGO profile is non-deterministic.
>> I saw that MemInfoBlock contains alloc/dealloc cpuid, does that make memprof profile non-deterministic in the sense that running memprof twice on the exact program and input would yield bit-wise different memory profile? I think IR PGO profile is deterministic?
>>
>>
>>
We need to use the file path instead of the function to be able to
distinguish COMDAT functions. The line_offset based matching is more
resilient if the entire function is moved, I think it's a good idea
and we can incorporate it into the IR matching phase.
>> Why do we use `file:line:discriminator` instead of `func:line_offset:discriminator `? The later would be more resilient to source change. If function name string is too long, we could perhaps leverage the MD5 encoding used by sample PGO?
>>
>>
>>
While we only intend to support Memprof optimizations for the main
binary, retaining all executable mappings allow future analysis tools
to symbolize shared library code.
>> Is the design of mmap section (quoted below) trying to support memprof for multiple binaries in the same process at the same time, or mainly for handling multiple non-consecutive executable segments for a single binary?
>>
>>
>>
>> > The process memory mappings for the executable segment during profiling are stored in this section. This allows symbolization during post processing for binaries which are built with position independent code. For now all read only, executable mappings are recorded, however in the future, mappings for heap data can also potentially be stored.
>>
>>
Yes, we do intend to support Memprof profile section merging via
`llvm-profdata merge`. The schema overhead per function is low now, so
we opted for function granularity. We can revisit if the overheads are
high or if the IR metadata scheme intends to keep it at module
granularity (in which case we don't need the extra fidelity).
>> Do we need each function record to have its own schema, do we expect different functions to use different versions/schemas? The is very flexible, but wondering what’s the use case. If the schema is for compatibility across versions, perhaps a file level scheme would be enough?
>>
>>
>>
>> > The InstrProfRecord for each function will hold the schema and an array of Memprof info blocks, one for each unique allocation context.
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Wenlei
>>

Wenlei He via llvm-dev

unread,

Oct 10, 2021, 4:34:28 PM10/10/21

to Teresa Johnson, Snehasish Kumar, llvm-dev, Xinliang David Li, andreyb...@gmail.com, Hongtao Yu

Correct, context info is never annotated onto the IR. It is used to help drive better context-sensitive inline decision and also preserves more accurate post-inline counts.

For partially inlined or out of line functions, we still look up context trie through context tracker to get profile. Context tracker is one layer above the context trie which “promote” context profile for not inlined callsites, this combined with top-down inline strategy of sample loader inliner make sures we always have the “best matching” context for partially inlined or out of line functions. Example: for A->B->C, if we decided not to inline B, B->C context would be promoted to be under “root” directly instead of under “A”. So profile for out-of-line B can be found under “root”, which naturally represents empty context. But if PGHO needs various contexts past inlining for context disambiguation later in the allocator runtime where inlining isn’t legal or beneficial, then yes we’d need to actually make profile metadata context-aware which is not happening today even with CSSPGO (I think we briefly discussed that in one of the patches – feels like we could have a generalized context-aware MD representation).