Android Binary Size Increase

Brian Geffon

unread,

Apr 6, 2021, 1:29:27 PM4/6/21

to binar...@chromium.org, Alexei Svitkine, George Burgess

Hello,

While trying to improve a hot UMA code path in
https://crrev.com/c/2802755 the binary size seems to have increased by
10kb. I wouldn't have expected such a large change in binary size
since UNLIKLEY just rearranges code to make the common path not
require a branch. At most I would have expected a few bytes from an
additional jump back to the common path, and at least on x86_64 that
unconditional jmp instruction is a two byte op-code. Given how
frequent this macro is embedded perhaps it's adding up. Are there any
thoughts on this?

Thanks!
Brian

Primiano Tucci

unread,

Apr 6, 2021, 3:17:09 PM4/6/21

to Brian Geffon, binar...@chromium.org, Alexei Svitkine, George Burgess

"slowpath_down_into_the_epilogueI won't find it surprising. From my anecdotal experience UNLIKELY ( which becomes __builtin_expect or equivalent) boils down to:

1. Change the branch instruction for architectures where the dynamic predictor is simple/limited and static prediction (e.g. assume untaken by default) still helps (maybe some ARM little cores unless you pass -mtune, not even sure about those)

2. Move all the unlikely code to the bottom of the function in the epilogue. This is really to improve cache hotness of the expectedly-taken side of the branch.

2 is what in my experience inflates the binary size, because the change from the codegen viewpoint is:

without likely/unlikely:

test register

bz hot_path

slow path instructions

hot_path:

hotpath instructions

With likely / unlikely

bz hot_path

jump slowpath_down_into_the_epilogue

hot_path:

hotpath instructions

...

/*At the end of the function*/

slowpath_down_into_the_epilogue:

slow path instructions

so each unlikely branch has an extra indirection layer (the jump slowpath_down_into_the_epilogue) + the actual code. Furthermore if the function body is large enough, on ARM ISAs the

"jump slowpath_down_into_the_epilogue" might involve more than one instruction, because the ability to do "direct jumps with immediate" within one instruction is quite limited (I don't recall the details, but the immediate part of a branch instruction has a very limited number of bits left for the immediate, which ARM tries to use in clever ways allowing to recombine them as multipliers and shifters but doesn't cover all possible cases). SO that jump becomes a "load a register" + "branch to register"

Then if for any reason you need to come back you have the same boilerplate to jump back.

--
You received this message because you are subscribed to the Google Groups "binary-size" group.
To unsubscribe from this group and stop receiving emails from it, send an email to binary-size...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/binary-size/CADyq12zSa2ac4gNrPWCvWLFY3CgKS3V_Gt_UUbiwA2gxigDrmg%40mail.gmail.com.

Primiano Tucci

unread,

Apr 6, 2021, 3:21:05 PM4/6/21

to Brian Geffon, binar...@chromium.org, Alexei Svitkine, George Burgess

Going back to the original question (Given how frequent this macro is embedded perhaps it's adding up. Are there any thoughts on this?) my main reaction would be:

how frequent are really these UMA histograms (in terms of being hit at runtime, not in terms of how used they are in the codebase).

My reasoning here would be:

1. If they are really that *hot* (say O(1k/second) or more), then the branch is only a tiny bit of what happens there. We should then look and hyper optimize the whole histogram bucket computation and anything else that happens when you hit a histogram.

2. If they are not that hot... does a branch hit/miss really matter either way?

George Burgess

unread,

Apr 6, 2021, 3:41:09 PM4/6/21

to Primiano Tucci, Brian Geffon, binar...@chromium.org, Alexei Svitkine

i agree with primiano@ about the likely causes; if we wanna dig deeper, i'd build A/B, point bloaty at the results, and maybe look at what changed in the functions with the largest size changes.

> 2. If they are not that hot... does a branch hit/miss really matter either way?

i'd imagine "added I$ pressure because we're loading more cold code," also matters, though can't accurately guesstimate how much.

it'd be interesting to see if the increase is present in TUs that have -Oz applied to them; in general, the compiler tries to keep code optimized with -Oz as small as possible. smells like a missed size optimization if branch metadata is making it generate larger code.

Daniel Cheng

unread,

Apr 6, 2021, 3:47:23 PM4/6/21

to George Burgess, Primiano Tucci, Brian Geffon, binary-size, Alexei Svitkine

IIRC, the metrics docs recommend using the histogram functions over the histogram macros now.

If we followed this guidance more broadly (to use the function in non-hot-paths), maybe the size increase would be negligible?

Daniel

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/binary-size/CA%2BrzOEnhjfiq2c%3DZz3Nww744OFwkB7nM4TOiQgYn33SWXtQYzg%40mail.gmail.com.

Brian Geffon

unread,

Apr 6, 2021, 3:47:41 PM4/6/21

to Primiano Tucci, binar...@chromium.org, Alexei Svitkine, George Burgess

> so each unlikely branch has an extra indirection layer (the jump slowpath_down_into_the_epilogue) + the actual code. Furthermore if the function body is large enough, on ARM ISAs the
> "jump slowpath_down_into_the_epilogue" might involve more than one instruction, because the ability to do "direct jumps with immediate" within one instruction is quite limited (I don't recall the details, but the immediate part of a branch instruction has a very limited number of bits left for the immediate

That's a good point, I was thinking x86_64 where the ability to do a
relative jmp might be possible in a two byte opcode, but the more I
think about it where these macros are being embedded almost certainly
wouldn't even allow that to be possible on x86_64, so while we will be
improving performance in the hot path it does seem it'll bloat the
binary size. I have another idea on how we can get the best of both
worlds. I'll update this thread in a bit.

Brian Geffon

unread,

Apr 6, 2021, 3:53:01 PM4/6/21

to Daniel Cheng, George Burgess, Primiano Tucci, binary-size, Alexei Svitkine

Hi Daniel,
The issue is primarily around initialization, if the same pattern is
followed that the macro uses (specifically static metric w/
initialization block) which results in a hot path taking a conditional
branch then performance will be affected as we're forced to jump over
initialization code every time we use a metric, do you know if this is
the situation?

@gbiv, I wonder, in the case of AFDO detecting the hot paths it should
be generating the same code right? So it seems that this shows there
were places that AFDO wasn't optimizing the branch, does that make
sense?

Brian

Daniel Cheng

unread,

Apr 6, 2021, 4:01:52 PM4/6/21

to Brian Geffon, George Burgess, Primiano Tucci, binary-size, Alexei Svitkine

The functions don't have a static helper internally, so they always require a lookup. I previously looked into replacing the custom init in the macro helpers with C++ statics, and that generated significantly more code. In my mind, it'd be ideal to reduce the use of the macro to the hot paths, and then just rely on the compiler's code generation--which presumably optimizes for the 'init rare' case already.

Daniel

George Burgess

unread,

Apr 6, 2021, 5:02:51 PM4/6/21

to Daniel Cheng, Brian Geffon, Primiano Tucci, binary-size, Alexei Svitkine

> @gbiv, I wonder, in the case of AFDO detecting the hot paths it should
> be generating the same code right? So it seems that this shows there
> were places that AFDO wasn't optimizing the branch, does that make
> sense?

this depends -- if these code paths aren't hot, they might very well not be represented in the AFDO profile (or they may have so few samples that they might as well not be in the profile at all ;) ).

for very hot branches, AFDO should nudge the compiler to lay them out as you detailed in your original post.

Alexei Svitkine

unread,

Apr 6, 2021, 5:07:38 PM4/6/21

to George Burgess, Daniel Cheng, Brian Geffon, Primiano Tucci, binary-size

I think given each macro is a separate instance of the code that gets inlined to the call site, AFDO or similar likely won't be able to optimize all of them even if it can help with some specific ones.

For the change in question, I think we should either revert it or make it conditional on the platform so that Android doesn't incur a 10K regression for an unclear benefit. Are there other thoughts?

Has anyone checked if the binary size is unchanged on non-Android platforms?

Brian Geffon

unread,

Apr 6, 2021, 5:14:43 PM4/6/21

to Alexei Svitkine, George Burgess, Daniel Cheng, Primiano Tucci, binary-size

For context, the reason I started looking into this was that some
metrics were added in a hot path and mojo and it appears to have
become noticeable in profiles;
https://b.corp.google.com/issues/181588577

I'm not saying that this branch was the cause, it was just something I
noticed while looking into this.

Daniel Cheng

unread,

Apr 6, 2021, 5:21:58 PM4/6/21

to Brian Geffon, Alexei Svitkine, George Burgess, Primiano Tucci, binary-size, Ken Rockot, Anand Mistry

Hmm, maybe we should consider taking those metrics out of Mojo? Or recording them more sporadically/sampling?

Brian Geffon

unread,

Apr 6, 2021, 5:22:01 PM4/6/21

to Alexei Svitkine, George Burgess, Daniel Cheng, Primiano Tucci, binary-size

I can create a revert:
https://chromium-review.googlesource.com/c/chromium/src/+/2809013, but
is 10kb a problem? What is the official guideline here on this?

Brian Geffon

unread,

Apr 6, 2021, 7:11:18 PM4/6/21

to Daniel Cheng, Alexei Svitkine, George Burgess, Primiano Tucci, binary-size, Ken Rockot, Anand Mistry

Fwiw, I was curious about Android binary size without LIKELY/UNLIKELY
(https://chromium-review.googlesource.com/c/chromium/src/+/2807825) it
seems that removing them drops the android binary size by 34KB.

Does anyone have guidance here, the CL I mailed seems to increase the
binary size by 10KB, is that a problem? We know that avoiding the
branch will have a performance benefit; however, I'm not clear how I
can go about quantifying that to justify the 10KB, is there any
guidance on this?

Brian Geffon

unread,

Apr 6, 2021, 7:24:11 PM4/6/21

to Daniel Cheng, Alexei Svitkine, George Burgess, Primiano Tucci, binary-size, Ken Rockot, Anand Mistry

Yes I agree we should also do what Daniel is suggesting, I can mail a CL for that.

Message has been deleted

Daniel Bratell

unread,

Apr 7, 2021, 2:34:05 AM4/7/21

to Brian Geffon, Alexei Svitkine, George Burgess, Daniel Cheng, Primiano Tucci, binary-size

Isn't the official guideline that the binary size increase has to be
worth it? I.e. you always have to weigh benefit to cost and no fixed
number will ever to correct. There is an implemented rule that changes
of a certain size have to be explained in the commit message, but any
increase could be ok if the value is high enough.

For this one my very personal opinion is that 10 KB is on the high side
for a change like this, but it all depends on how much time it will
save. Now it looks like you have a better plan so if that one works out,
reverting this to keep the code simpler and smaller wouldn't be wrong,
would it?

/Daniel