"slowpath_down_into_the_epilogueI won't find it surprising. From my anecdotal experience UNLIKELY ( which becomes __builtin_expect or equivalent) boils down to:
1. Change the branch instruction for architectures where the dynamic predictor is simple/limited and static prediction (e.g. assume untaken by default) still helps (maybe some ARM little cores unless you pass -mtune, not even sure about those)
2. Move all the unlikely code to the bottom of the function in the epilogue. This is really to improve cache hotness of the expectedly-taken side of the branch.
2 is what in my experience inflates the binary size, because the change from the codegen viewpoint is:
without likely/unlikely:
test register
bz hot_path
slow path instructions
hot_path:
hotpath instructions
With likely / unlikely
bz hot_path
jump slowpath_down_into_the_epilogue
hot_path:
hotpath instructions
...
...
/*At the end of the function*/
slowpath_down_into_the_epilogue:
slow path instructions
so each unlikely branch has an extra indirection layer (the jump slowpath_down_into_the_epilogue) + the actual code. Furthermore if the function body is large enough, on ARM ISAs the
"jump slowpath_down_into_the_epilogue" might involve more than one instruction, because the ability to do "direct jumps with immediate" within one instruction is quite limited (I don't recall the details, but the immediate part of a branch instruction has a very limited number of bits left for the immediate, which ARM tries to use in clever ways allowing to recombine them as multipliers and shifters but doesn't cover all possible cases). SO that jump becomes a "load a register" + "branch to register"
Then if for any reason you need to come back you have the same boilerplate to jump back.