[llvm-dev] MCJit Runtine Performance

124 views
Skip to first unread message

Morten Brodersen via llvm-dev

unread,
Feb 4, 2016, 7:00:55 PM2/4/16
to llvm...@lists.llvm.org
Hi All,

We recently upgraded a number of applications from LLVM 3.5.2 (old JIT)
to LLVM 3.7.1 (MCJit).

We made the minimum changes needed for the switch (no changes to the IR
generated or the IR optimizations applied).

The resulting code pass all tests (8000+).

However the runtime performance dropped significantly: 30% to 40% for
all applications.

The applications I am talking about optimize airline rosters and
pairings. LLVM is used for compiling high level business rules to
efficient machine code.

A typical optimization run takes 6 to 8 hours. So a 30% to 40% reduction
in speed has real impact (=> we can't upgrade from 3.5.2).

We have triple checked and reviewed the changes we made from old JIT to
MCJIt. We also tried different ways to optimize the IR.

However all results indicate that the performance drop happens in the
(black box) IR to machine code stage.

So my question is if the runtime performance reduction is known/expected
for MCJit vs. old JIT? Or if we might be doing something wrong?

If you need more information, in order to understand the issue, please
tell us so that we can provide you with more details.

Thanks
Morten

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Keno Fischer via llvm-dev

unread,
Feb 4, 2016, 7:05:54 PM2/4/16
to Morten Brodersen, llvm-dev
Yes, unfortunately, this is very much known. Over in the julia project, we've recently gone through this and taken the hit (after doing some work to fix the very extreme corner cases that we were hitting). We're not entirely sure why the slowdown is this noticable, but at least in our case, profiling didn't reveal any remaining low hanging fruits that are responsible. One thing you can potentially try if you haven't yet is to enable fast ISel and see if that brings you closer to the old runtimes.

Hal Finkel via llvm-dev

unread,
Feb 4, 2016, 7:16:23 PM2/4/16
to Keno Fischer, llvm-dev, Morten Brodersen
----- Original Message -----
> From: "Keno Fischer via llvm-dev" <llvm...@lists.llvm.org>
> To: "Morten Brodersen" <Morten.B...@constrainttec.com>
> Cc: "llvm-dev" <llvm...@lists.llvm.org>
> Sent: Thursday, February 4, 2016 6:05:29 PM
> Subject: Re: [llvm-dev] MCJit Runtine Performance
>
>
>
> Yes, unfortunately, this is very much known. Over in the julia
> project, we've recently gone through this and taken the hit (after
> doing some work to fix the very extreme corner cases that we were
> hitting). We're not entirely sure why the slowdown is this
> noticable, but at least in our case, profiling didn't reveal any
> remaining low hanging fruits that are responsible. One thing you can
> potentially try if you haven't yet is to enable fast ISel and see if
> that brings you closer to the old runtimes.

And maybe the register allocator? Are you using the greedy one or the linear one? Are there any other MI-level optimizations running?

-Hal

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Lang Hames via llvm-dev

unread,
Feb 4, 2016, 7:52:06 PM2/4/16
to Hal Finkel, Keno Fischer, llvm-dev, Morten Brodersen
These are some pretty extreme slowdowns. The legacy JIT shared the code generator with MCJIT, and as far as I'm aware there were really only three main differences:

1) The legacy JIT used a custom instruction encoder, whereas MCJIT uses MC.
2) (Related to 1) MCJIT needs to perform runtime linking of the object files produced by MC.
3) MCJIT does not compile lazily (though it sounds like that's not an issue here?)

Keno - did you ever look at the codegen pipeline construction for the legacy JIT vs MCJIT? Are we choosing different passes?

Morten - Can you share any test cases that demonstrate the slowdown. I'd love to take a look at this.

Cheers,
Lang.

Keno Fischer via llvm-dev

unread,
Feb 4, 2016, 7:56:57 PM2/4/16
to Lang Hames, llvm-dev, Morten Brodersen
We are using the same IR passes. We did not look at the the backend passes other than fast isel, because I didn't realize we had a choice there, do we? In our profiling, nothing in MCJIT specifically (relocations, etc.) are taking any significant amount of time. As far as we could tell most of the slow down was in ISel, with a couple additional percent in various IR passes.

Lang Hames via llvm-dev

unread,
Feb 4, 2016, 8:03:45 PM2/4/16
to Keno Fischer, llvm-dev, Morten Brodersen
Hi Keno,

> ... I didn't realize we had a choice there...

You do, though I don't think the dials and levers have been plumbed up to the interface. I'm happy to take a look at doing that. I'd be very happy for clients to have more options here.

Cheers,
Lang. 

Morten Brodersen via llvm-dev

unread,
Feb 4, 2016, 9:42:20 PM2/4/16
to llvm-dev
Hi Keno,

Thanks for the fast ISel suggestion.

Here are the results (for a small but representational run):

LLVM 3.5.2 (old JIT): 4m44s

LLVM 3.7.1 (MCJit) no fast ISel: 7m31s

LLVM 3.7.1 (MCJit) fast ISel: 7m39s

So not much of a difference unfortunately.

Lang Hames via llvm-dev

unread,
Feb 4, 2016, 9:46:06 PM2/4/16
to Morten Brodersen, llvm-dev
Hi Morten,

Here are the results (for a small but representational run):

That suggests an optimization quality issue, rather than compile-time overhead. That's good news - I'd take it as a good sign that the MC and linking overhead aren't a big deal either, and if we can configure the CodeGen pipeline properly we can get the performance back to the same level as the legacy JIT.

Cheers,
Lang.

Morten Brodersen via llvm-dev

unread,
Feb 4, 2016, 10:22:34 PM2/4/16
to llvm-dev
Hi Lang,


> That suggests an optimization quality issue, rather than compile-time overhead

Yes that makes sense. The long running applications (6+ hours) JIT the rules once (taking a few seconds) and then run the generated machine code for hours. With no additional JIT'ing.


> if we can configure the CodeGen pipeline properly we can get the performance back to the same level as the legacy JIT.

Sounds great. Happy to help with whatever is needed.

Speaking of which:

We generate low overhead profiling code as part of the generated IR. We use it for identifying performance bottlenecks in the higher level (before IR) optimizing stages.

So I think it would be possible for me to identify a function that runs much slower in 3.7.1. than in 3.5.2. And extract the IR.

Would that help?

Cheers
Morten

Hal Finkel via llvm-dev

unread,
Feb 4, 2016, 10:26:30 PM2/4/16
to Morten Brodersen, llvm-dev
----- Original Message -----
> From: "Morten Brodersen via llvm-dev" <llvm...@lists.llvm.org>
> To: "llvm-dev" <llvm...@lists.llvm.org>
> Sent: Thursday, February 4, 2016 9:21:58 PM
> Subject: Re: [llvm-dev] MCJit Runtine Performance
>
>
> Hi Lang,
>
> > That suggests an optimization quality issue, rather than
> > compile-time overhead
>
> Yes that makes sense. The long running applications (6+ hours) JIT
> the rules once (taking a few seconds) and then run the generated
> machine code for hours. With no additional JIT'ing.
>
> > if we can configure the CodeGen pipeline properly we can get the
> > performance back to the same level as the legacy JIT.
>
> Sounds great. Happy to help with whatever is needed.
>
> Speaking of which:
>
> We generate low overhead profiling code as part of the generated IR.
> We use it for identifying performance bottlenecks in the higher
> level (before IR) optimizing stages.
>
> So I think it would be possible for me to identify a function that
> runs much slower in 3.7.1. than in 3.5.2. And extract the IR.
>
> Would that help?

It seems quite likely to help. Please do.

-Hal

--

Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Morten Brodersen via llvm-dev

unread,
Feb 4, 2016, 10:27:18 PM2/4/16
to llvm-dev
Hi Hal,

We are using the default register allocator. I assume the greedy one is
default?

As for other target machine optimizations:

I have tried:

llvm::TargetMachine* tm = ...;

tm->setOptLevel(llvm::CodeGenOpt::Aggressive);

And it doesn't make much of a difference.

And also:

tm->setFastISel(true);

(previous email).

Is there anything else I can try?

Rafael Espíndola

unread,
Feb 4, 2016, 10:28:27 PM2/4/16
to Morten Brodersen, llvm-dev
Can you build the code with llc? Try with the large code model. I
think that is the default for MCJIT and can be less efficient.

Cheers,
Rafael


On 4 February 2016 at 22:26, Morten Brodersen via llvm-dev

Morten Brodersen via llvm-dev

unread,
Feb 4, 2016, 10:31:46 PM2/4/16
to Hal Finkel, Morten Brodersen, llvm-dev
OK. I will ask the optimization guys to extract a good example from the
production code.

Hal Finkel via llvm-dev

unread,
Feb 4, 2016, 10:35:54 PM2/4/16
to Morten Brodersen, llvm-dev
----- Original Message -----
> From: "Morten Brodersen via llvm-dev" <llvm...@lists.llvm.org>
> To: "llvm-dev" <llvm...@lists.llvm.org>
> Sent: Thursday, February 4, 2016 9:26:51 PM
> Subject: Re: [llvm-dev] MCJit Runtine Performance
>
> Hi Hal,
>
> We are using the default register allocator. I assume the greedy one
> is
> default?
>
> As for other target machine optimizations:
>
> I have tried:
>
> llvm::TargetMachine* tm = ...;
>
> tm->setOptLevel(llvm::CodeGenOpt::Aggressive);
>
> And it doesn't make much of a difference.
>
> And also:
>
> tm->setFastISel(true);
>
> (previous email).
>
> Is there anything else I can try?

From your previous e-mail, it seems like this is a case of too little optimization, not too much, right?

Are you creating a TargetTransformInfo object for your target?

CodeGenPasses->add(
createTargetTransformInfoWrapperPass(TM->getTargetIRAnalysis())

I assume you're dominated by integer computation, not floating-point, is that correct?

-Hal

--

Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Morten Brodersen via llvm-dev

unread,
Feb 4, 2016, 10:40:25 PM2/4/16
to llvm-dev
Hi Lang,


> MCJIT does not compile lazily (though it sounds like that's not an issue here?)

That is not an issue here since the code JIT's once (a few secs) and then run the generated machine code for hours.


> Morten - Can you share any test cases that demonstrate the slowdown. I'd love to take a look at this.

The code is massive so not practical. However I will try and extract an example function that demonstrates the difference (as per previous email).

Morten Brodersen via llvm-dev

unread,
Feb 4, 2016, 10:49:02 PM2/4/16
to llvm-dev
Hi Rafael,

Not easily (llc).

Is there a way to make MCJit not use the large code model when JIT'ing?

Cheers
Morten

Rafael Espíndola

unread,
Feb 4, 2016, 11:44:21 PM2/4/16
to Morten Brodersen, llvm-dev, Davide Italiano
On 4 February 2016 at 22:48, Morten Brodersen via llvm-dev

<llvm...@lists.llvm.org> wrote:
> Hi Rafael,
>
> Not easily (llc).
>
> Is there a way to make MCJit not use the large code model when JIT'ing?
>

I think Davide started adding support for the small code model.

Cheers,
Rafael

Keno Fischer via llvm-dev

unread,
Feb 4, 2016, 11:58:34 PM2/4/16
to llvm-dev, Morten Brodersen, Davide Italiano
Actually, reading over all of this again, I realize I may have made the wrong statement. The runtime regressions we see in julia are actually regressions in how long LLVM itself takes to do the compilation (but since it happens at run time in the JIT case, I think of it as a regression in our running time). We have only noticed occasional regressions in the performance of the generated code (which we are in the process of fixing). Which kind of regression are you talking about, time taken by LLVM or time taken by the LLVM-generated code?

Morten Brodersen via llvm-dev

unread,
Feb 5, 2016, 12:13:17 AM2/5/16
to llvm-dev
Hi Keno,

I am talking about runtime. The performance of the generated machine code. Not the time it takes to lower the IR to machine code.

We typically only JIT once (taking a few secs) and then run the generated machine code for hours. So the JIT time (IR -> machine code) doesn't impact us.

Cheers
Morten

Jim Grosbach via llvm-dev

unread,
Feb 5, 2016, 1:54:36 AM2/5/16
to Morten Brodersen, Arnaud Allard de Grandmaison via llvm-dev
I agree with Lang and Keno here. This is both unexpected and very interesting. Given the differences in defaults between the two, I would have expected the new JIT to have better performance but longer compile times. That you are seeing the opposite implies there is something very wrong and I'm very interested to help figure out what it is. 

Sent from my iPad

Larry Gritz via llvm-dev

unread,
Feb 5, 2016, 3:01:28 AM2/5/16
to llvm...@lists.llvm.org
We have had the same experience with Open Shading Language (OSL). We found that MCJIT was significantly slower than old JIT, and partially for that reason we are (sorry) still using LLVM 3.4 in production.

We have basically two use cases: (1) offline queued batch processing on a computation farm, in which a 50% hit in compilation time (seconds to minutes of CPU time) is not a big deal compared to the many hours of time for a full render (and which even a SLIGHT improvement in runtime of the resulting JITed code makes up for it); but also (2) interactive use in front of a human, where the JIT time is experienced as waiting around for something to happen (mostly at the beginning of the run, when they are antsy to see the first results show up on screen), and having that suddenly get 50% slower is a really big deal.

This is quite different than something like clang, where longer compilation time may annoy developers (or not, they like their coffee breaks) but would never be noticed by end users. Our users wait for the JIT every time they use the software.

I can see that the MCJIT takes much longer than old JIT, but I'm afraid I never profiled it or investigated specifically why this is the case. For unrelated reasons, my users have largely been unable to switch their toolchains to C++11 up until now, so they were also stuck on LLVM 3.4 and thus the need to figure out what was up with MCJIT was not a high priority for me. But now that the switch to C++11 is afoot this year, unlocking more recent LLVM releases to me, MCJIT is on my radar again, so it's the perfect time for this topic to get revived.

-- lg


> On Feb 4, 2016, at 11:37 PM, Keno Fischer via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Actually, reading over all of this again, I realize I may have made the
> wrong statement. The runtime regressions we see in julia are actually
> regressions in how long LLVM itself takes to do the compilation (but since
> it happens at run time in the JIT case, I think of it as a regression in
> our running time). We have only noticed occasional regressions in the
> performance of the generated code (which we are in the process of fixing).
> Which kind of regression are you talking about, time taken by LLVM or time
> taken by the LLVM-generated code?
>

--
Larry Gritz
l...@larrygritz.com

Lang Hames via llvm-dev

unread,
Feb 5, 2016, 3:13:57 AM2/5/16
to Jim Grosbach, Arnaud Allard de Grandmaison via llvm-dev, Morten Brodersen
Hi Morten,

Something else just occurred to me: can you share your EngineBuilder configuration lines? (http://llvm.org/docs/doxygen/html/classllvm_1_1EngineBuilder.html)

In particular - are you explicitly setting the optimization level? The old JIT may have had a different default.

- Lang.



Sent from my iPad

Benoit Belley via llvm-dev

unread,
Feb 5, 2016, 9:34:40 AM2/5/16
to Morten Brodersen, llvm-dev
Hi Morten,

We have experienced a similar slow down in execution performance when upgrading to LLVM 3.7. The issue for us was that our front-end was emitting alloca instruction in non-entry basic blocks. After fixing the generation of LLVM IR in our front-end, we got similar or better performant with LLVM 3.7. See:


Maybe, this is something that you can double check.

Here’s a detailed explanation of the cause of the slowdown:

With LLVM 3.7, We have noticed that the MemCpy pass will attempt to copy LLVM struct using moves that are as large as possible. For example, a struct of 3 floats is copied using a 64-bit and a 32-bit move. It is therefore important that such a struct be aligned on 8-byte boundary, not just 4 bytes! Else, one runs the risk of triggering store-forwarding failure pipelining stalls (which we did encountered really badly with one of our internal performance benchmark). It is therefore important that the SROA pass correctly eliminates the load/store to the alloca memory regions.
Benoit


Benoit Belley

Sr Principal Developer

M&E-Product Development Group

 

MAIN +1 514 393 1616

DIRECT +1 438 448 6304

FAX +1 514 393 0110

 

Twitter

Facebook

 

Autodesk, Inc.

10 Duke Street

Montreal, Quebec, Canada H3C 2L7

www.autodesk.com

 

Description: Email_Signature_Logobar

 

Tim Northover via llvm-dev

unread,
Feb 5, 2016, 9:48:42 AM2/5/16
to Lang Hames, Arnaud Allard de Grandmaison via llvm-dev, Morten Brodersen
On 5 February 2016 at 00:13, Lang Hames via llvm-dev

<llvm...@lists.llvm.org> wrote:
> In particular - are you explicitly setting the optimization level? The old
> JIT may have had a different default.

Did we change what you had to do to set the CPU at some point, or am I
misremembering that? If the old JIT automatically went for
-march=native but the new one doesn't that could explain the slowdown.

Tim.

Morten Brodersen via llvm-dev

unread,
Feb 7, 2016, 6:58:58 PM2/7/16
to Benoit Belley, llvm-dev
Thanks for this Benoit. I will investigate.

Cheers
Morten

Morten Brodersen via llvm-dev

unread,
Feb 7, 2016, 8:01:12 PM2/7/16
to Lang Hames, Jim Grosbach, Arnaud Allard de Grandmaison via llvm-dev, Morten Brodersen
Hi Lang,


> can you share your EngineBuilder configuration lines?

Sure.

The 3.5.2 version use:

        llvm::ExecutionEngine* ee =
            llvm::EngineBuilder(module)
                .setEngineKind(llvm::EngineKind::JIT)
                .setOptLevel(llvm::CodeGenOpt::Aggressive)
                .create();

        module->setDataLayout(ee->getTargetMachine()->getDataLayout());

And the 3.7.1 version use:

        llvm::EngineBuilder builder(move(modulePtr));

        builder.setEngineKind(llvm::EngineKind::JIT);
        builder.setErrorStr(&error);
        builder.setOptLevel(llvm::CodeGenOpt::Aggressive);

        llvm::ExecutionEngine* ee = builder.create();

        module->setDataLayout(*ee->getTargetMachine()->getDataLayout());

Cheers
Morten

Paweł Bylica

unread,
Feb 8, 2016, 4:38:05 AM2/8/16
to Morten Brodersen, Lang Hames, Jim Grosbach, Arnaud Allard de Grandmaison via llvm-dev
Hi all,

Can someone also explain how to configure MCJIT to generate code for native target (like clang's -march=native)?

Thanks,
Paweł

Matt Godbolt via llvm-dev

unread,
Feb 8, 2016, 9:47:41 AM2/8/16
to Paweł Bylica, Morten Brodersen, Lang Hames, Jim Grosbach, Arnaud Allard de Grandmaison via llvm-dev
I'm using
    builder.setMCPU(llvm::sys::getHostCPUName())

which seems to do the trick.

--matt

Lang Hames via llvm-dev

unread,
Feb 12, 2016, 1:32:28 PM2/12/16
to Morten Brodersen, Arnaud Allard de Grandmaison via llvm-dev
Thanks Morten,

I'll check this out and confirm that these optimization options are being plumbed through as expected.

Were you able to produce any test-cases that demonstrate the slow-down in the end?

- Lang.

Morten Brodersen via llvm-dev

unread,
Feb 14, 2016, 6:50:04 PM2/14/16
to Lang Hames, Morten Brodersen, Arnaud Allard de Grandmaison via llvm-dev
Thanks Lang.

We are working on it (test-cases + ideas proposed by people on this list).

The production/optimization guys have to do this in-between customer work. And a single realistic run takes hours. So progress is steady but not fast.

I will report results on llvm-dev when we have them.

Cheers
Morten
Reply all
Reply to author
Forward
0 new messages