[llvm-dev] (RFC) JumpMaps: switch statement optimization

Witold Waligora via llvm-dev

unread,

Feb 13, 2017, 8:11:33 AM2/13/17

to llvm...@lists.llvm.org

Hi All,

This is our first appearance on llvm-dev so a small introduction is in
order.
We are a team of two working for good few years now on an out-of-tree
embedded back-end. Our primary goal is code size, secondary performance.
We've recently implemented an optimization that we believe others could
use. It's about time to contribute back.

-------------------------------------------------------
JumpMaps: a generalization of JumpTables
JumpTables produce fast and small code but have a limitation - they only
work if case values are easy to compute from the variable being
switch()ed. To overcome this limitation we introduce a {key,value}
structure - a JumpMap.

Simplifying somewhat, LLVM would currently generate a sequential
if-elseif- for small switches and an inline binary-search for large
switches, if it can't produce a JumpTable. JumpMaps would instead
generate a call followed by a jump-through-register.
Pseudo-code example:

switch(x) {
case 0x6851: {BB1}
case 0x1383: {BB2}
case 0x2224: {BB3}
default: {defaultBB}
}

Without Jump Maps:
if(x==0x6851) {
goto BB1
}
else if(x==0x1383) {
goto BB2
}
else if(x==0x2224){
goto BB3
}
else{
goto defaultBB
}

With Jump Maps:
jumpmap_0 = {
keys = {3, 0x6851, 0x1383, 0x2224}
vals = {defaultBB, BB1, BB2, BB3}
}
addr dst = __jumpmap_find(&jumpmap_0, x)
goto dst;

On our target Jump Maps produce both smaller and faster code even for
quite small switch statements. We believe other architectures would
benefit as well, X86 and ARM included:
- jumpmap struct has a good chance of being cached more efficiently than
a large chunk of inline binary-search code
- for large switches branch prediction should have an easier time
compared to inline binary-search
- embedded architectures (ARM, MSP430) would benefit from smaller code
(trading off .text for .rodata)

In terms of lowering, JumpMap flow is very similar to that of
JumpTables. The lowering is even slightly easier as JumpMaps do not
require an extra range-check basic-block - the default case is also
handled by __jumpmap_find.
Targets need to provide lowering for the map structure and
__jumpmap_find function, which would typically become a part of
compier-rt library and could be written in hand-crafted asm for extra speed.

Questions or comments?
Please let me know.

Best Regards,
Witold Waligóra
Myre Laboratories
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Friedman, Eli via llvm-dev

unread,

Feb 13, 2017, 7:24:02 PM2/13/17

to Witold Waligora, llvm...@lists.llvm.org

Smaller codesize for switches could be interesting for Thumb targets.
Some thoughts.

1. For Thumb, the overall size is more important that constants vs.
code; the tables will end up in the text segment anyway.

2. How many versions of __jumpmap_find do you need? If you're going for
size, you probably want to put 1 or 2 byte relative jump offsets into
the table, if possible. You'd also want to shrink the keys if possible:
for a small range, we could use 2-byte, or even 1-byte keys. You could
also encode deltas into the table, rather than absolute values; the
deltas are likely to be smaller than the keys, and you're iterating over
the table anyway.

3. Is the linear search a problem? I mean, the linear nature of it
doesn't matter for small tables, but eventually you get to a point where
it just doesn't make sense. Linearly iterating over 1000 entries in a
table is clearly going to be slower than a binary search. Or is the
idea that you'd do a binary search down to say, 20 entries, then call
__jumpmap_find?

4. How often does this pattern come up, in your experience?

-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

Joerg Sonnenberger via llvm-dev

unread,

Feb 13, 2017, 7:40:22 PM2/13/17

to llvm...@lists.llvm.org

On Mon, Feb 13, 2017 at 02:11:24PM +0100, Witold Waligora via llvm-dev wrote:
> JumpMaps: a generalization of JumpTables
> JumpTables produce fast and small code but have a limitation - they only
> work if case values are easy to compute from the variable being
> switch()ed. To overcome this limitation we introduce a {key,value}
> structure - a JumpMap.

We've been discussing supporting either perfect hashing or just a
pre-transformation a while ago. That would have a very similar use case,
the primary issue is the additional code for checking that the hashed
value is actually the expected value.

Joerg

Nema, Ashutosh via llvm-dev

unread,

Feb 14, 2017, 4:53:59 AM2/14/17

to Witold Waligora, llvm-dev

Hi Witold,

Can you explain what you meant by adding support in compiler_rt & changes in lowering.

Did you tried any benchmarks, please share if you have any numbers with performance improvements & code size reduction ?

Regards,
Ashutosh

Witold Waligora via llvm-dev

unread,

Feb 14, 2017, 8:16:58 AM2/14/17

to Friedman, Eli, llvm...@lists.llvm.org

Eli,

> 2. How many versions of __jumpmap_find do you need?

In our out-of-tree implementation we currently encode case values as-is,
thus we need 4 versions of __jumpmap_find function (i8,i16,i32,i64).
We have an analysis pass that helps us figure out which kinds of maps
are actually profitable.
For example, if we discover that we only have one i8 jumpmap but many
i16 jumpmaps, then we would zero-extend the i8 map to i16 so that we
don't use __jumpmap_find_i8 at all and it can be removed.

> If you're going for
> size, you probably want to put 1 or 2 byte relative jump offsets into
> the table, if possible. You'd also want to shrink the keys if possible:
> for a small range, we could use 2-byte, or even 1-byte keys. You could
> also encode deltas into the table, rather than absolute values; the
> deltas are likely to be smaller than the keys, and you're iterating over
> the table anyway.

Yes, many options are possible. In general, as long as jumpmap_find()
function can undo what the compiler did to the map you'd be fine. We
plan to keep this opt flexible when upstreaming.

> 3. Is the linear search a problem? I mean, the linear nature of it
> doesn't matter for small tables, but eventually you get to a point where
> it just doesn't make sense. Linearly iterating over 1000 entries in a
> table is clearly going to be slower than a binary search. Or is the
> idea that you'd do a binary search down to say, 20 entries, then call
> __jumpmap_find?

Our __jumpmap_find does binary-search internally and it turned out to be
both faster and smaller than what compiler would generate otherwise
(bittest clustering).
There is some concern about how well jumpmaps could integrate with
existing switch clustering code for a pure speed Target.

> 4. How often does this pattern come up, in your experience?

This is highly dependent on how large switches will be profitable to
fold as a jumpmap on a particular target.
For us the break-even point is at 5 cases which is very small and fires
quite often.
I can't say for X86/speed, but as code size opt, this fires on any
switch statement >= 5 that fails to fold as a JumpTable.

Witold

Witold Waligora via llvm-dev

unread,

Feb 14, 2017, 8:28:12 AM2/14/17

to Nema, Ashutosh, llvm-dev

JumpMap lowering is nearly identical to that of JumpTables with the
exception of lack of range-check basic-block.
We introduce JumpMapInfo structure which follows the same flow as
JumpTableInfo and is finally emitted by AsmPrinter.

There are many ways a Target may want to encode jumpmaps (deltas,
compression, relative vs absolute), so we plan to keep this flexible and
target-driven when upstreaming.

Witold

W dniu 2017-02-14 o 10:53, Nema, Ashutosh pisze:

Witold Waligora via llvm-dev

unread,

Feb 14, 2017, 8:47:42 AM2/14/17

to llvm...@lists.llvm.org

I didn't answer compiler-rt and benchmarks question.

compiler-rt would have to implement the decoding function(s) to undo
whatever the compiler did. This would be target-dependent.
Our target implements four of them, one for each basic type:
jumpmap_find_i8, jumpmap_find_i16, jumpmap_find_i32, jumpmap_find_i64.

We don't have any benchmarks for any of the in-tree targets yet.

Witold

W dniu 2017-02-14 o 14:28, Witold Waligora via llvm-dev pisze:

Hans Wennborg via llvm-dev

unread,

Feb 14, 2017, 1:39:21 PM2/14/17

to Witold Waligora, llvm-dev

I wonder if it would make sense to emit the jumpmap_find_ functions in
IR rather than in compiler-rt.

Many targets don't link against compiler-rt, e.g. x86 Linux typically
uses libgcc.

If they're emitted in IR, the functions could potentially be inlined.
For example if the size of the switch is known to be small so no
binary search is done, could inlining the find_ function be smaller
than setting up the call?

James Molloy via llvm-dev

unread,

Feb 14, 2017, 3:55:49 PM2/14/17

to Hans Wennborg, Witold Waligora, llvm-dev

Hi,

Thanks for suggesting this! It sounds like a great optimization.

Hans makes a good point that in some cases inlining the map function could be very cheap; especially with a very well written linear search loop. I found this precisely when implementing a similar feature - compressed jump tables for Thumb-1.

Something to bear in mind when making this flexible in terms of the types of search functions it can emit is that it's possible, as you alluded earlier, to overoptimize and end up linking several different search algorithms and wasting space. This is even a problem with LTO as we do switch lowering function-at-a-time in isolation.

Sounds Iike a great idea though, I look forward to seeing it in tree!

James

Nema, Ashutosh via llvm-dev

unread,

Feb 15, 2017, 1:12:17 AM2/15/17

to Hans Wennborg, Witold Waligora, llvm-dev

Agree with Hans points as not all targets link against compiler-rt, if these functions get generated in IR then the compiler-rt dependency won't be there, also keeping these functions in IR may fetch more performance gains by inlining.

Looking forward to try this optimization.

Marta Wszola via llvm-dev

unread,

Mar 31, 2017, 4:23:02 AM3/31/17

to llvm...@lists.llvm.org, ha...@chromium.org, Witold Waligora, Ashuto...@amd.com

We have the prototype version working on X86. Our prototype:
a) analyses switch statements conditions' and keys' byte sizes,
b) finds jump maps (of course, jump tables (also partitioned) have
priority),
c) creates appropriate function calls,
d) creates simple jump map structure in memory.
We've prepared a few tests in C to verify correctness.

We've got a few questions regarding implementation:
1) Right now, the data structure that we emit uses position independent
addresses. Of course, it's not an optimal solution. Should we implement
relative addresses and various relocations support in our patch or
should we leave it for X86 target developers?
2) What is the best way to insert search functions? At which step should
we do it? We were thinking about creating all functions early in IR and
deleting the unused ones after CodeGen (our prototype does not create
functions, it "assumes" that they are linked). We can alternatively
write a library to be linked later.
3) We've come into problems with the analysis step. Since it needs to
know the minimal sizes of jump tables and jump maps (and it probably
will have even more target hooks in the future), it's a TargetMachine
pass required by ISel. However, we cannot register a TargetMachine
Module pass, it makes PassManager either throw an assertion stating that
the pass cannot be scheduled or enter an infinite loop. We didn't have
this problem with older LLVM version. Right now, we decided to turn it
into a function pass, but it's not what we really want. How (where?)
should we register a TargetMachine Module analysis pass?
4) What is the best way to test it? We need to test not only calls
creation (which should be easy with FileCheck) but also the data
structure and the functions being called.

Finally, should we submit the code we have so far to Phabricator for the
early review now? As you can see, the implementation is not complete.

Best Regards,
Marta Wszoła
Myre Laboratories

Hans Wennborg via llvm-dev

unread,

Mar 31, 2017, 5:04:58 AM3/31/17

to Marta Wszola, llvm-dev

Yes, I think uploading the patch would be useful to help the discussion.

Thanks,
Hans

Evgeny Astigeevich via llvm-dev

unread,

Mar 31, 2017, 7:13:02 AM3/31/17

to Marta Wszola, llvm-dev, nd

Hi Marta,

My team is working on code size improvements on ARM.
The optimization looks interesting.
We can help with porting to ARM, benchmarking and reviewing patches.

Evgeny Astigeevich
Senior Compiler Engineer
Compilation Tools
ARM

Marta Wszola via llvm-dev

unread,

May 2, 2017, 8:04:24 AM5/2/17

to llvm...@lists.llvm.org

Initial code for early review has been submitted to Phabricator:
https://reviews.llvm.org/D32742