Re: CPU AOT compilation

141 views

Skip to first unread message

Ben Vanik

unread,

Mar 20, 2020, 4:06:06 PM3/20/20

to Stella Laurenzo, iree-discuss, Lei Zhang, Mahesh Ravishankar, Ahmed Taei, Oleg Rybakov, Nicolas Vasilache, Hanhan Wang

(cc'ing thread to iree-discuss - marius may be interested)

On Fri, Mar 20, 2020 at 1:03 PM Ben Vanik <benv...@google.com> wrote:

Good point Lei!
The combination of scheduling and execution IR is possible though I'm not sure we'd get much that we aren't able to get from MLIR better so I've been avoiding the complexity of designing things so that it would be easily possible (like avoiding the function pointer/vtable dynamic lookup so that the linker could see things statically).

The CSE, DCE, etc we can perform at the higher levels obviates anything we'd get once we had a LLVM IR big soup - in that, we'd never put functions in the module we export if we didn't dispatch them so LLVM's DCE doesn't do much across that, etc. If everything was executing synchronously and single-threaded there may be some inlining opportunities (dispatches on the scheduling side get turned into function calls and possibly inlined), however the coolest features (like coalescing/dynamic batching) only come with async and LLVM can't do much across async boundaries as it's inherently unsafe to do so at that low level.

So yeah it's nonintuitive but even though both the scheduling VM -> LLVM IR and the execution codegen -> LLVM IR parts are really interesting on their own the fact that they get linked together is almost just a deployment detail (which I hope I correctly interpret as a good sign of orthogonal design ;). The benefit of taking the scheduling code to LLVM IR is that if you statically linked that with the runtime you could link-time drop all the HAL methods you didn't use (as we have the vm.imports explicitly defined and already DCEd to just the set used) so you save on code-size. Linking both the scheduling and execution side together would mostly just remove duplication between CRT functions (and my hope is that most of our generated code doesn't use the CRT at all - no printf's in the middle of kernels :) - which is still valuable, but thankfully doesn't mean we have to compromise the rest of the design to make it possible to do anything more than that

On Fri, Mar 20, 2020 at 12:23 PM Stella Laurenzo <laur...@google.com> wrote:
FYI - some additional commentary forked onto the #core-development discord channel for this.

On Fri, Mar 20, 2020 at 12:17 PM Lei Zhang <antia...@google.com> wrote:
Maybe this is covered by the static library approach, but I guess there is also a mode that one converts both the execution logic and scheduling logic into LLVM IR and then throws them into one gigantic llvm module to leverage existing llvm infrastructure to optimize. This blurs the boundary between scheduling and execution and remerges both worlds if one just wants a single executable to rule them all.

Thanks,
Lei

On Fri, Mar 20, 2020 at 2:02 PM Ben Vanik <benv...@google.com> wrote:
> It makes me happy when multiple people reach the same conclusion. Lesser chance of being totally wrong. :)

agreed! This was the intent with the design so I'm relieved you're all thinking along the same lines :P

Both of the AOT deployment cases are interesting - one provides an object file that can be linked in with the rest of the program (avoiding the need for dynamic loading at all) while the other enables dynamic module loading and hot swapping (rebuild just the module and reload without restarting the program). Users can choose what they want to do based on use cases or on platform capabilities - for example, you can't download and run shared objects on iOS, but you can on Android - or if targeting webassembly you'd want to have the modules separate so that you can download them on demand and reduce startup time. The best part is that almost all of the engineering required there is the same and only the last-mile (last-kilometer?) differs.

The new hal.interface makes moving things across the boundary much cleaner. Now you (effectively) need just a single argument to your exported function to access all of the I/O. VMLA does this for example by passing in a vmla.interface:
https://github.com/google/iree/blob/benvanik-uniform-buffers/iree/compiler/Dialect/VMLA/Transforms/test/transformation.mlir
If you make that a struct with some function pointers for retrieving the bindings/constants/etc then you have a way to inject the runtime functions from the HAL side via nice stable C API semantics (and can version it if you want to load older shared object modules, etc). The code in the HAL backend calling into the functions could be identical whether it's calling through a pointer to a function within the same binary or through one dlsym'ed from a shared object (or JIT'ed!). You could use the same struct to provide any imported functions you want (like library calls that are compiled into the host executable such that every compiled module doesn't need their own copy to reduce code size).

A nice property of the scheduler being in control of any multithreading is that the compiled modules don't need any of that in them. This reduces code size, prevents the classic TF-style thousand threads from tons of different threadpools that don't know about each other situation, and means that modules compiled at different versions can still coexist in the same binary as there's no complex dynamic linkage that could cause skew. The resulting compiled modules should be tiny - no C++ stdlib, no CRT (in most cases), etc.

As part of https://github.com/google/iree/issues/1036 the compiler backends will be able to insert their own passes after translation. This could be used, for example, to find all of the hal.executable fragments and combine them into a single LLVM module in a natural place. Part of https://github.com/google/iree/issues/1168 is allowing the backends to serialize their executables later in the process so that you could take the newly globbed LLVM module and compile it (or serialize it to LLVM IR for JIT) as a single action.

This is nice. Similarly we will be able to combine several spv.modules into one and serialize together. There is a student interested in the SPIR-V module combiner open project and wants to contribute; this is where it integrates in.

So yeah, I think it's awesome that a tightly scoped interface allows for almost identical LLVM IR executables to be compiled into the host executable, into shared objects for dynamic loading, or JIT'ed at runtime with only a small bit of glue (module loading/vs JITing/etc) as the difference. REPLs/colab can use JIT, desktop apps can use AOT shared objects, and embedded/mobile apps can use AOT static libraries.

The best part is that since all that work is shared a lot of the foundation (defining the interfaces, the injection, etc) can be done with the current JIT and then the other scenarios are just extensions to that we can add as needed as much more targeted work. Enabling the AOT shared object path should be mostly futzing with LLVM driver configuration and using the DynamicLibrary helper we already have to load them, which doesn't feel like a scary amount of work :)

On Fri, Mar 20, 2020 at 10:00 AM Mahesh Ravishankar <ravish...@google.com> wrote:
It makes me happy when multiple people reach the same conclusion. Lesser chance of being totally wrong. :)

On Fri, Mar 20, 2020 at 9:51 AM Ahmed Taei <at...@google.com> wrote:
Thanks Mahesh for the right up. This is very similar to what I have in mind.

- Going from llvmjit to llvmaot (let's call it that for now :) ) is a simple modification (additional option) to the llvm-ir backend so instead of generating an executable with LLVMIR serialized then executed by llvmjit at runtime It generates a shared library where each executable does have a single function with a fixed ABI (It's here just needs to move to the backend).

- Then the llvm/mlir free hal runtime will just need to load the shared library and invoke the "_invoke_" function / dispatch region.

- What I was thinking beyond this (despite portability/security issues ) is to serialize the shared object within the hal module so we don't have to ship a module + shared_lib per model. But this isn't a big deal for now let's handle deployment later.

~Ahmed

On Fri, Mar 20, 2020 at 9:18 AM Mahesh Ravishankar <ravish...@google.com> wrote:
Hello All,

Ahmed and I had a brief chat about this and wanted to just get a broader audience in to get a bit more comments. Ahmed said he had talked to Stella and Oleg about having a CPU executable for compiled models, so something that doesnt rely on using JIT. Wanted to suggest one approach based on something I have done previously (and works fairly well) and something that Ben mentioned in his talk.

The goal here of AOT as I see it is that the entire model should be compiled into a stand-alone CPU executable which would not depend on LLVM or MLIR. One approach which matches well with IREEs current flow is as follows
1) Create dispatch regions as is created today. Compile each dispatch region into a function.
2) The "host" side of the IREE flow, i.e. the HAL scheduling can be lowered into something very similar to the VMLA IR. This scheduling code is effectively the "main" of the executable. This "scheduling" IR can be converted into LLVM IR (or we can use the LLVM dialect in MLIR for this) to invoke these functions. Each call site knows which function it has to call, so there is no need for a dynamic name resolution or linking of any sort. You end up with a single LLVM module that can be compiled for the host to get a stand-along X86/ARM binary.

I am glossing over a lot of mechanics, i.e. how to marshal the inputs to the model , etc. etc. Solution for those is that instead of generating a main, you generate a single function with a fixed ABI. So instead of generating an executable, you generate a shared library which exports a single function (with a C interface). You can then call this function either from a driver that marshals resources (and its a simple C++ function that can be model agnostic) or invoke it from python directly.
This is similar to what Ben mentioned w.r.t VMLA being able to be converted to simple LLVM IR and compiled.

WDYT?
--
Mahesh

--
Mahesh

Reply all

Reply to author

Forward

0 new messages