Hi Daniel,
Your observations seem valid to me. Some high-level comments from my side.
As you said, the loops are quite similar. We have also observed that in general we generate more code around loops, in the function prologue and epilogue,
where some data and arguments get moved and reshuffled etc. While this is very obvious in these micro-benchmarks, it hasn't bothered us enough yet for larger apps where this is less important (or where others things are more important). The
outlier looks indeed to be Clang -Oz for memcpy_alt2, that is perhaps a "code-size bug". As I haven't looked into it, it's too early for me to blame this on just the addressing modes as there could be several things going on.
Since this is a micro-benchmark, and lowering memcpy is a bit of an art ;-), for which a specialised implementation is probably available, you might want
to look at some other codes too that are important for you.
Your remarks about execution times might be right too, and as you said, probably best confirmed with benchmark numbers. In our
group, we have not really looked into performance for the Cortex-M0, probably because it's the only v6m core (although the Cortex-m23 and Armv8-M Baseline is very similar) and code-size would be more important for us, but there might be something to be gained
here.
Cheers,
Sjoerd.