TL;DR: Let's virtualize compiler outputs in Clang. These patches would get us started:- https://reviews.llvm.org/D95501 Add llvm::vfs::OutputManager- https://reviews.llvm.org/D95502 Initial adoption of llvm::vfs::OutputManager in Clang.Questions for the reader- Should we virtualize compiler outputs in Clang? (Hint: yes.)
- Does this support belong in LLVM? (I think it does, so that non-Clang tools can easily reuse it.)
- Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think `llvm::` itself is too broad.)
- Do you have a use case that this won't address well?- Should that be fixed in the initial patch, or could this be evolved in-tree to address that?- Any other major concerns / suggestions?
- If you think the above patches should be split up for initial review / commit, how?
(Other feedback welcome too!)Longer versionThere are a number of use cases for capturing compiler outputs, which I'm hoping this proposal is a step toward addressing.- Tooling wants to capture outputs directly, without going through the filesystem.- Sometimes, tertiary outputs can be ignored, or still need to be written to disk.- clang-scan-deps is using a form of stripped down "implicit" modules to determine which modules need to be built explicitly. It doesn't really want to be using the on-disk module cache—in-memory would be better.- clang's ModuleManager manually maintains an in-memory modules cache for implicit modules. This involves copying the PCM outputs into memory. It'd be better for these modules to be file-backed, instead of copies of the stream.The patch has a bunch of details written / tested (https://reviews.llvm.org/D95501). Here are the high-level structures in the design:- OutputManager—a shared manager for creating outputs without knowing about the storage.- OutputConfig—configuration set on the OutputManager that can be (partially) overridden for specific outputs.- Output—opaque object with a raw_pwrite_stream, output path, and `erase`/`close` functionality. Internally, it has a linked list of output destinations.- OutputBackend—abstract interface installed in an OutputManager to create the "guts" of an output. While an OutputManager only has one installed, these can be layered / forked / mirrored.- OutputDestination—abstract interface paired with an OutputBackend, whose lifetime is managed by an Output.- ContentBuffer—actual content to allow efficient use of data by multiple backends. For example, the installed backend is a mirror between an on-disk and in-memory backend, the in-memory backend will either get the content moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed file.The patch includes a few backends:- NullOutputBackend, for ignoring all backends.
- OnDiskOutputBackend, for writing to disk (the default), initially based on the logic in `clang::CompilerInstance`.
- InMemoryOutputBackend, for writing to an `InMemoryFileSystem`.
- MirroringOutputBackend, for writing to multiple backends. OutputDestination's API is designed around supporting this.
- FilteringOutputBackend, for filtering which outputs get written to the underlying backend.
Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
Other work in the areaSee also: https://reviews.llvm.org/D78058 (thanks to Marc Rasi for posting that patch, and to Sam McCall for some feedback on an earlier version of this proposal).
Thanks for reading!Duncan
_______________________________________________
cfe-dev mailing list
cfe...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
This roughly corresponds to OutputBackend + OutputDestination in the patch, except:- the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction
- OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why
As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.
- Any other major concerns / suggestions?Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.
Relatedly, working-directory/relative-path handling should be considered.
(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)
Why doesn't this inherit from llvm::vfs::FileSystem?Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else.But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.
If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.
But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.
On 2021 Jan 28, at 11:24, Duncan P. N. Exon Smith via llvm-dev <llvm...@lists.llvm.org> wrote:Thirdly, there's some boilerplate related to lifetimes of multiple destinations. Probably having an explicit `MirroringOutputDestination` would be better.
Probably it's better to have the OutputBackend return an Output directly (and get rid of the OutputManager).
- Is `Output::getPath()` an abstraction leak?
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Sam, any thoughts here?On 2021 Feb 2, at 19:36, Duncan P. N. Exon Smith via llvm-dev <llvm...@lists.llvm.org> wrote:Update: I've incorporated much of Sam's feedback into the main patch (https://reviews.llvm.org/D95501).- Simplify OutputConfig, restricting it to semantic information about a specific output. Sink all backend configuration to flags on the backends themselves.
- Remove OutputManager, instead exposing OutputBackend directly.
- Merge Output and OutputDestination into a single class called OutputFile, and rename the API for creating them to OutputBackend::createFile().- Add support for working directories via OutputDirectory, and add OutputBackend::getDirectory() and OutputBackend::createDirectory().- Add support for createUniqueFile() and createUniqueDirectory(), both heavily used in clang. Backends without read access can use StableUniqueEntityAdaptor to implement these.
The main thing not settled is the threading guarantees. Restating the straw man I proposed:- All OutputBackend APIs are safe to call concurrently. Since OutputDirectory *is* an OutputBackend, it can be used concurrently as well.- An OutputFile cannot be used concurrently, but two files from the same backend can be.
Interested in everyone's thoughts on that; if that sounds reasonable I can update the patch to make it so.Two other points:- Sam proposed dropping OutputConfig. I don't think we can, as the API client needs to communicate semantic information about specific outputs, not about the backends generally.- Sam suggested OutputDestination (now OutputFile) seemed bloated. I talked through some of the details in my previous response; most of the complexity is there to make MirroringOutputBackend work efficiently and avoid duplication in subclasses. As a counterpoint, the only API a concrete subclass needs to override is storeContentBuffer().
Sam, let please take another look and let me know if you have more high-level comments.
Hi Duncan,
Thanks for this work. At AMD we were working on similar lines for a use case that requires entire process of compilations and linking to be done in-memory. The approach we were following was introducing write capability in LLVM VFS and use it with compiler outputs. This patch allows us to have in-memory compilations. I have two observations from the point of view of the use case above (I am not sure if it makes sense to have it here)
Are you planning to extend similar support for LLD in future?
Thanks,
Dinesh
From: llvm-dev <llvm-dev...@lists.llvm.org>
On Behalf Of Sam McCall via llvm-dev
Sent: Monday, February 22, 2021 7:41 PM
To: Duncan P. N. Exon Smith <dexon...@apple.com>
Cc: LLVM Dev <llvm...@lists.llvm.org>; CFE Dev <cfe...@lists.llvm.org>
Subject: Re: [llvm-dev] [cfe-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs
[CAUTION: External Email]
Only other thing is the scope of the patch - it's useful that as an RFC it's comprehensive and shows things fitting together.
Are you planning to extend similar support for LLD in future?
- The capability of deleting directories is not available. Applications use temp directories for compilation and it will be good to delete them.
The generated unique names when going through StableUniqueEntityHelper follows a predictable series of names unlike current implementation in LLVM which generates random names. This could pose a challenge in a scenario where application is performing repeated compilations each using separate Output backends.
Hi Duncan,
Thank you for getting back.
The use case is primarily JIT compilation for GPU specific code in OpenCL/HIP. We have a library that facilitates the JIT compilation. For a single compilation process it starts by creating temporary directories, write the kernel and header files to these directories. Thereafter invoke clang do the compilation and finally read the output into memory for loading onto the GPU. The cleanup process deletes the temporary directories.
Eventually we would like to bring the whole compilation process to be done in memory. One of the approaches I took while experimenting with OutputBackend patch is to have a persistent vfs::InmemoryFilesystem which can be used across compilations. An In-Memory OutputBackend using the above InmemoryFilesystem is created on the fly for each compilation. I chose not to use a persistent OutputBackend out of convenience as now I have only one object to pass around across the stack while InMemoryFileSystem is required to supply the input.
This is where I found that deletion/naming of temporary directories can pose a challenge. I could work around by either choosing to persist both OutputBackend and InmemoryFileSystem objects or none.
>>> Good point. For the OnDiskOutputBackend, we could default to non-stable / random names; for other backends, I prefer the assumption that the compilation is being done in isolation... but a backend that wants to coordinate better is free to do so. Does that make sense to you?
I understand.
Dinesh
On 2021 Mar 15, at 14:18, Duncan P. N. Exon Smith via cfe-dev <cfe...@lists.llvm.org> wrote:Thanks for the replies Sam! I've just back after being out for a bit; I'll try to give a substantive reply later this week.
I may have missed it (apologies if so), but is there somewhere a breakdown of which performance-critical configurations this is aimed at?(e.g. many small outputs that don't benefit from streaming, mirrored to two output streams, but one is null)Performance is certainly going to cut against simplicity sometimes, but it's hard to evaluate without knowing what we're optimizing for.