[llvm-dev] [RFC] Embedding Bitcode in Object Files

Steven Wu via llvm-dev

unread,

Feb 3, 2016, 1:25:41 PM2/3/16

to cfe...@lists.llvm.org, LLVM Developers Mailing List

Apple has some internal implemenation for embedding bitcode in the object file
that we would like to upstream. It has few changes to clang frontend, including
new clang options, clang driver changes and utilities to embed bitcode inside
object file. We believe upstreaming these implementations will benefit the
people who would like to develop software on Apple platform using open source
LLVM. It also helps the driver compatibility and it aligns with some of ongoing
efforts like Thin-LTO which also has an object wrapper for bitcode.

Embedded Bitcode Design:
Embedded Bitcode are designed to enable bitcode distribution without disturbing
normal development flow. When a program is compiled with bitcode, clang will
embed the optimized bitcode in a special section in the object file, together
with the options that is used during the compilation. The object file will still
have the normal TEXT, DATA sections for normal linking. During the linking,
linker will check all the input object files have embedded bitcode and collect
the bitcode into an archive which is embedded in the output. The archive also
contains all the information that is needed to rebuild the linked binary. All
compilation and linking stage can be replayed to generated the final binary.

There are mainly two parts we would like to upstream first:
1. Clang Driver:
Adding -fembed-bitcode option. When this new option is used, it will split the
compilation into 2 stages. The first stage runs the frontend and all the
optimization passes, and the second stage embeds the bitcode from the first
stage then runs the CodeGen passes. There is also a -fembed-bitcode-marker
option that doesn't split the compilation into 2 stages and it only puts an 1
byte marker into the object file. This is used to speed up the debug build
because bitcode serialization and verification will make -fembed-bitcode slower
especially with -O0 -g. Linker can still check the presence of the section to
provide feedback if any of the object files participated in the linking is
missing bitcode in a full bitcode build.
2. Bitcode Embedding:
Several special sections are used by bitcode to mark the presence of the bitcode
in the MachO file.
"__LLVM, __bitcode" is used to store the optimized bitcode in the object file.
It can have an 1-byte size as a marker to provide diagnostics in debug build.
"__LLVM, __cmdline" is used to store the clang command-line options. There are
few options that are not reflected in the bitcode that we would like to replay in
the rebuild. For example, '-O0' option makes us run FastISel during rebuild.

Thanks

Steven
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Peter Collingbourne via llvm-dev

unread,

Feb 3, 2016, 1:48:59 PM2/3/16

to Steven Wu, LLVM Developers Mailing List, cfe...@lists.llvm.org

Hi Steven,

Can you please explain how this relates to the existing .llvmbc section
feature?

Peter

> cfe-dev mailing list
> cfe...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

--
Peter

Steven Wu via llvm-dev

unread,

Feb 3, 2016, 2:01:34 PM2/3/16

to Peter Collingbourne, LLVM Developers Mailing List, cfe...@lists.llvm.org

Hi Peter

It is not currently related because we started the implementation before Thin-LTO
gets proposed in the community but our "__LLVM, __bitcode" section is pretty much
the same as ".llvmbc" section. Note ".llvmbc" doesn't really follow the section
naming convention for MachO objects. I am hoping to unify them during the upstream
of the implementation.

Thanks

Steven

Rafael Espíndola

unread,

Feb 4, 2016, 5:01:01 PM2/4/16

to Steven Wu, LLVM Developers Mailing List, cfe-dev

On 3 February 2016 at 14:01, Steven Wu via llvm-dev

<llvm...@lists.llvm.org> wrote:
> Hi Peter
>
> It is not currently related because we started the implementation before Thin-LTO
> gets proposed in the community but our "__LLVM, __bitcode" section is pretty much
> the same as ".llvmbc" section. Note ".llvmbc" doesn't really follow the section
> naming convention for MachO objects. I am hoping to unify them during the upstream
> of the implementation.

That would be my main request. Seems like a nice feature, but we
should have one implementation of it :-)

BTW, can you explain a bit why you need things like "-O0" recorded? In
case you want to go from bitcode back to object file one file at a
time (no LTO)? Is that orthogonal? That is, should the command line be
included in .bc files too? What is the command line option that is
included, the -cc1 or the driver one?

There was some discussion on the past about which options get run in
clang if given -flto. For example, it seems likely that a more
conservative inlining pass would be a good thing to not remove
opportunities for the link time inlining. What would happen with
"-flto -fembed-bitcode"? Would the bitcode be the same as with just
-flto and the object file less optimized?

Cheers,
Rafael

Sergei Larin via llvm-dev

unread,

Feb 4, 2016, 5:19:08 PM2/4/16

to Rafael Espíndola, Steven Wu, llvm...@lists.llvm.org, cfe-dev

Steven,

I would like to echo Rafael's comments.

My general understanding is that given an object file with embedded IR I should be able to reproduce the same object.
Everything else should be "supporting" that objective... which might include relevant flags and transformations leading _to_ this IR and _from_ this IR to the given object code.

Does my understanding matches your overall goal?

Thanks.

Sergei

---
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev...@lists.llvm.org] On Behalf Of Rafael
> Espíndola via llvm-dev
> Sent: Thursday, February 04, 2016 4:01 PM
> To: Steven Wu
> Cc: LLVM Developers Mailing List; cfe-dev
> Subject: Re: [llvm-dev] [cfe-dev] [RFC] Embedding Bitcode in Object Files
>
> On 3 February 2016 at 14:01, Steven Wu via llvm-dev <llvm-

Steven Wu via llvm-dev

unread,

Feb 4, 2016, 5:59:17 PM2/4/16

to Sergei Larin, llvm...@lists.llvm.org, cfe-dev

Hi Sergei and Rafael

Thanks for the comment!

In terms of bitcode section, my plan is to make "__LLVM, __bitcode" section the MachO version of ".llvmbc" section. In latest Darwin OS, "__LLVM" segment will not be loaded by dyld when you try to execute a binary with embedded bitcode which is a plus for this feature.

And for the command line, Sergei has the correct idea about the motivation behind this. We want to have enough information to recreate the same binary from the embedded bitcode (at least when compiled with the same compiler). Here is an example:
$ clang -fembed-bitcode -O0 test.c -c -###
"clang" "-cc1" (...lots of options...) "-o" "test.bc" "-x" "c" "test.c" <--- First stage
"clang" "-cc1" "-triple" "x86_64-apple-macosx10.11.0" "-emit-obj" "-fembed-bitcode" "-O0" "-disable-llvm-optzns" "-o" "test.o" "-x" "ir" "test.bc" <--- Second stage
If we record all the options from the second stage, we can recreate the same object file using the exact same command. So, yes, they are cc1 flags. I understand they are no stable but second stage can only have a handful of options that: 1. affects codegen. 2. not embedded in the bitcode that should be record. This list should be shrinking towards zero eventually (not sure about -O0 and other optimization options). If we have to rename them before removing them from the embedding option list, we can provide upgrade for them.

This feature is orthogonal to LTO. For my current implementation, "-flto -fembed-bitcode" is the same as "-flto". Linker need to have the logic to handle a llvm bitcode file (treated as LTO) and a macho file with embedded bitcode (treated as normal link) differently.

Thanks

Steven

James Y Knight via llvm-dev

unread,

Feb 5, 2016, 5:15:06 PM2/5/16

to Steven Wu, LLVM Developers Mailing List, Clang Dev

On Wed, Feb 3, 2016 at 1:25 PM, Steven Wu via llvm-dev <llvm...@lists.llvm.org> wrote:

"__LLVM, __cmdline" is used to store the clang command-line options. There are
few options that are not reflected in the bitcode that we would like to replay in
the rebuild. For example, '-O0' option makes us run FastISel during rebuild.

Without knowing more details of your implementation, I'd be concerned about how this might impact deterministic/reproducible builds.

Source paths are recorded in a number of places, but you can typically fix that by using -fdebug-prefix-map. But if the entire command-line including the -fdebug-prefix-map argument gets stored in the output too, then you still have a problem.

Steven Wu via llvm-dev

unread,

Feb 5, 2016, 6:07:06 PM2/5/16

to James Y Knight, LLVM Developers Mailing List, Clang Dev

I don't think we need any path in the command line section. We only record the command-line options that will affect CodeGen. See my example in one of the preview reply:

$ clang -fembed-bitcode -O0 test.c -c -###
"clang" "-cc1" (...lots of options...) "-o" "test.bc" "-x" "c" "test.c" <--- First stage
"clang" "-cc1" "-triple" "x86_64-apple-macosx10.11.0" "-emit-obj" "-fembed-bitcode" "-O0" "-disable-llvm-optzns" "-o" "test.o" "-x" "ir" "test.bc" <--- Second stage

I can't think of any source path that can affect CodeGen. There should not be any paths other than the bitcode input path and binary output path exists in the second stage and they are excluded from the command line section as well. -fdebug-prefix-map is consumed by the front-end and prefixed paths are a part of the debug info in the metadata. You don't need to encode -fdebug-prefix-map in the bitcode section to reproduce the object file with the same debug info. Did that answer your concern?

Thanks

Steven

James Y Knight via llvm-dev

unread,

Feb 5, 2016, 6:13:21 PM2/5/16

to Steven Wu, LLVM Developers Mailing List, Clang Dev

On Fri, Feb 5, 2016 at 6:06 PM, Steven Wu <stev...@apple.com> wrote:

I don't think we need any path in the command line section. We only record the command-line options that will affect CodeGen. See my example in one of the preview reply:
$ clang -fembed-bitcode -O0 test.c -c -###
"clang" "-cc1" (...lots of options...) "-o" "test.bc" "-x" "c" "test.c" <--- First stage
"clang" "-cc1" "-triple" "x86_64-apple-macosx10.11.0" "-emit-obj" "-fembed-bitcode" "-O0" "-disable-llvm-optzns" "-o" "test.o" "-x" "ir" "test.bc" <--- Second stage
I can't think of any source path that can affect CodeGen. There should not be any paths other than the bitcode input path and binary output path exists in the second stage and they are excluded from the command line section as well. -fdebug-prefix-map is consumed by the front-end and prefixed paths are a part of the debug info in the metadata. You don't need to encode -fdebug-prefix-map in the bitcode section to reproduce the object file with the same debug info. Did that answer your concern?

Great -- it wasn't clear from the first message if you were just embedding the whole command-line as is. If the plan instead to embed only a few relevant options, I agree there should be no issue as far as paths go.

Smith, Kevin B via llvm-dev

unread,

Feb 6, 2016, 5:30:27 PM2/6/16

to James Y Knight, Steven Wu, llvm...@lists.llvm.org, Clang Dev

I don't know whether this is an issue in the current implementation, but I wanted to bring up a potential privacy issue.

In embedding the information, care should be taken to avoid embedding any information that contains personally identifiable information. This can certainly occur

if paths need to be embedded, as user names, or other private/confidential information may be present in the naming of directories and paths. Generally, I suspect

that it would be desirable to have an opt-in strategy for designating in the compiler which pieces of information/options need to be saved, and for all options marked

as needed, determine whether there is the possibility/likelihood that they may contain personally identifiable information.

Kevin B. Smith

Hal Finkel via llvm-dev

unread,

Feb 6, 2016, 5:37:25 PM2/6/16

to Kevin B Smith, llvm...@lists.llvm.org, Clang Dev

From: "Kevin B via llvm-dev Smith" <llvm...@lists.llvm.org>
To: "James Y Knight" <jykn...@google.com>, "Steven Wu" <stev...@apple.com>
Cc: llvm...@lists.llvm.org, "Clang Dev" <cfe...@lists.llvm.org>
Sent: Saturday, February 6, 2016 4:30:20 PM
Subject: Re: [llvm-dev] [RFC] Embedding Bitcode in Object Files

I don't know whether this is an issue in the current implementation, but I wanted to bring up a potential privacy issue.

In embedding the information, care should be taken to avoid embedding any information that contains personally identifiable information. This can certainly occur

if paths need to be embedded, as user names, or other private/confidential information may be present in the naming of directories and paths.

Is this any more of a problem than the information that gets included in the DWARF sections?

-Hal

Generally, I suspect

that it would be desirable to have an opt-in strategy for designating in the compiler which pieces of information/options need to be saved, and for all options marked

as needed, determine whether there is the possibility/likelihood that they may contain personally identifiable information.

Kevin B. Smith

From: llvm-dev [mailto:llvm-dev...@lists.llvm.org] On Behalf Of James Y Knight via llvm-dev
Sent: Friday, February 05, 2016 3:13 PM
To: Steven Wu <stev...@apple.com>
Cc: LLVM Developers Mailing List <llvm...@lists.llvm.org>; Clang Dev <cfe...@lists.llvm.org>
Subject: Re: [llvm-dev] [RFC] Embedding Bitcode in Object Files

On Fri, Feb 5, 2016 at 6:06 PM, Steven Wu <stev...@apple.com> wrote:

I don't think we need any path in the command line section. We only record the command-line options that will affect CodeGen. See my example in one of the preview reply:

$ clang -fembed-bitcode -O0 test.c -c -###
"clang" "-cc1" (...lots of options...) "-o" "test.bc" "-x" "c" "test.c" <--- First stage
"clang" "-cc1" "-triple" "x86_64-apple-macosx10.11.0" "-emit-obj" "-fembed-bitcode" "-O0" "-disable-llvm-optzns" "-o" "test.o" "-x" "ir" "test.bc" <--- Second stage

I can't think of any source path that can affect CodeGen. There should not be any paths other than the bitcode input path and binary output path exists in the second stage and they are excluded from the command line section as well. -fdebug-prefix-map is consumed by the front-end and prefixed paths are a part of the debug info in the metadata. You don't need to encode -fdebug-prefix-map in the bitcode section to reproduce the object file with the same debug info. Did that answer your concern?

Great -- it wasn't clear from the first message if you were just embedding the whole command-line as is. If the plan instead to embed only a few relevant options, I agree there should be no issue as far as paths go.

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--

Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Smith, Kevin B via llvm-dev

unread,

Feb 6, 2016, 5:58:00 PM2/6/16

to Hal Finkel, llvm...@lists.llvm.org, Clang Dev

Hal,

No, it is not more of a problem than with DWARF info. DWARF info definitely contains personally identifiable information. However, people usually realize that is the case,

and will turn off or strip debug info if they are worried about such issues, or make a specific plan to cleanse that information.

You really just want to attempt to eliminate such information to the greatest extent possible. The desirability of using embedded Bitcode in libraries (which is a very

natural use model, that I'm pretty sure this is intended to support), will be improved by taking into consideration this aspect of the implementation.

Kevin Smith

Mehdi Amini via llvm-dev

unread,

Feb 6, 2016, 8:46:58 PM2/6/16

to Smith, Kevin B, llvm...@lists.llvm.org, Clang Dev

Hi,

There is not only DWARF but any use of the macro __FILE__ (so any assertions for instance).

I wouldn't expect the bitcode to contain any more (or less) information than the binary.

The options for the optimizer/codegen shouldn't need any "sensitive" information.

--

Mehdi

Steven Wu via llvm-dev

unread,

Feb 6, 2016, 9:03:03 PM2/6/16

to Smith, Kevin B, llvm...@lists.llvm.org, Clang Dev

Hi Kevin

That is a very good concern and we have ways to address the issue in our bitcode implementation to achieve similar something similar to ‘strip’ (hiding unnecessary symbols and debug info). It wasn’t in the proposal because we would like to get the basics in before diving into something more detailed and controversial.

Here is a short description about how we deal with the issue. Our implementation requires linker support which runs a ‘Linkage-Unit’ pass that consistently rename all the symbols and metadata that are not exported. This has to be done after resolving all the symbols. We would be happy to upstream our implementation if it is beneficial.

Thanks

Steven

Joerg Sonnenberger via llvm-dev

unread,

Feb 6, 2016, 9:36:51 PM2/6/16

to cfe...@lists.llvm.org, llvm...@lists.llvm.org

On Sat, Feb 06, 2016 at 05:46:50PM -0800, Mehdi Amini via cfe-dev wrote:
> There is not only DWARF but any use of the macro __FILE__ (so any assertions for instance).
> I wouldn't expect the bitcode to contain any more (or less) information than the binary.
> The options for the optimizer/codegen shouldn't need any "sensitive" information.

__FILE__ is a frontend issue, I still have to add some equivalent to my
remap patches for that into clang...

Joerg

Smith, Kevin B via llvm-dev

unread,

Feb 6, 2016, 10:32:27 PM2/6/16

to stev...@apple.com, llvm...@lists.llvm.org, Clang Dev

I don't know what is involved in upstreaming that, but yes, it seems very useful/necessary to me.

Sergei Larin via llvm-dev

unread,

Feb 8, 2016, 11:07:49 AM2/8/16

to Steven Wu, Smith, Kevin B, llvm...@lists.llvm.org, Clang Dev

My 2c…

Benefits of the feature clearly outweigh any potential privacy concerns from my point of view… and yes, there are multiple ways to deal with privacy even if it is an issue.

Sergei

---

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

Adve, Vikram Sadanand via llvm-dev

unread,

Feb 8, 2016, 11:32:12 AM2/8/16

to llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org

Steven,

How much of the code you’re upstreaming is specific to MacOS? Is it close to being usable on Linux? Some of the comments about Mach-O make it sound MacOS-specific, but you also talked about unifying it with the .llvmbc implementation previously.

FYI, two more use cases, which we are interested in:

(1) Autotuning the generated code on a target machine, using the embedded bitcode as a starting point. We have a limited prototype that searches through combinations of Clang command-line options, similar to Milepost GCC. We assume for now that we have a linked bitcode, e.g., the output of LTO; to be usable in practice, the bitcode would need to be embedded with the binary.

(2) Dynamic optimization, using the LLVM bitcode for subsets of the program.

--Vikram

// Vikram S. Adve
// Professor, Department of Computer Science
// University of Illinois at Urbana-Champaign
// http://llvm.org <http://llvm.org/>
//

On 2/4/16, 6:20 PM, "llvm-dev on behalf of via llvm-dev" <llvm-dev...@lists.llvm.org on behalf of llvm...@lists.llvm.org> wrote:

>Date: Thu, 04 Feb 2016 14:59:03 -0800
>From: Steven Wu via llvm-dev <llvm...@lists.llvm.org>
>To: Sergei Larin <sla...@codeaurora.org>
>Cc: llvm...@lists.llvm.org, cfe-dev <cfe...@lists.llvm.org>
>Subject: Re: [llvm-dev] [cfe-dev] [RFC] Embedding Bitcode in Object
> Files
>Message-ID: <1C9491F7-D36F-46C6...@apple.com>
>Content-Type: text/plain; charset=utf-8

>
>Hi Sergei and Rafael
>
>Thanks for the comment!
>
>In terms of bitcode section, my plan is to make "__LLVM, __bitcode" section the MachO version of ".llvmbc" section. In latest Darwin OS, "__LLVM" segment will not be loaded by dyld when you try to execute a binary with embedded bitcode which is a plus for this feature.
>
>And for the command line, Sergei has the correct idea about the motivation behind this. We want to have enough information to recreate the same binary from the embedded bitcode (at least when compiled with the same compiler). Here is an example:

>$ clang -fembed-bitcode -O0 test.c -c -###
>"clang" "-cc1" (...lots of options...) "-o" "test.bc" "-x" "c" "test.c" <--- First stage
>"clang" "-cc1" "-triple" "x86_64-apple-macosx10.11.0" "-emit-obj" "-fembed-bitcode" "-O0" "-disable-llvm-optzns" "-o" "test.o" "-x" "ir" "test.bc" <--- Second stage

>If we record all the options from the second stage, we can recreate the same object file using the exact same command. So, yes, they are cc1 flags. I understand they are no stable but second stage can only have a handful of options that: 1. affects codegen. 2. not embedded in the bitcode that should be record. This list should be shrinking towards zero eventually (not sure about -O0 and other optimization options). If we have to rename them before removing them from the embedding option list, we can provide upgrade for them.
>
>This feature is orthogonal to LTO. For my current implementation, "-flto -fembed-bitcode" is the same as "-flto". Linker need to have the logic to handle a llvm bitcode file (treated as LTO) and a macho file with embedded bitcode (treated as normal link) differently.
>
>Thanks
>
>Steven
>
>
>>On Feb 4, 2016, at 2:18 PM, Sergei Larin <sla...@codeaurora.org> wrote:
>>Steven,
>> I would like to echo Rafael's comments.
>>My general understanding is that given an object file with embedded IR I should be able to reproduce the same object.
>>Everything else should be "supporting" that objective... which might include relevant flags and transformations leading _to_ this IR and _from_ this IR to the given object code.
>>Does my understanding matches your overall goal?
>>Thanks.

>>Sergei
>>---
>>Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

Steven Wu via llvm-dev

unread,

Feb 18, 2016, 12:32:15 PM2/18/16

to Sergei Larin, llvm...@lists.llvm.org, Clang Dev

Thanks everyone for giving me feedback. I will send out patches for the feature very shortly. Upstreaming bitcode obfuscation to handle privacy issue is the next on my list, but it will be a separate discussion.

Steven

Reply all

Reply to author

Forward