code size of different architectures (was: Has stack ...)

Anton Ertl

unread,

Aug 9, 2017, 9:43:42 AM8/9/17

to

lars.br...@gmail.com writes:
>It would be interesting to remake Anton's code size survey with the
>current crop of instruction sets and compiler technologies.

Debian currently has different sets of architectures for the different
releases, and different versions of programs for the releases, so I
produced two sets of data, as follows:

for i in amd64 armel armhf i386 ia64 mips mipsel powerpc s390 s390x sparc; do
wget http://ftp.at.debian.org/debian/pool/main/b/bash/bash_4.2+dfsg-0.1+deb7u3_$i.deb
wget http://ftp.at.debian.org/debian/pool/main/g/grep/grep_2.12-2_$i.deb
wget http://ftp.at.debian.org/debian/pool/main/g/gzip/gzip_1.5-1.1_$i.deb
done
for i in amd64 armel armhf i386 ia64 mips mipsel powerpc s390 s390x sparc; do
ar x bash_4.2+dfsg-0.1+deb7u3_$i.deb; tar xfz data.tar.gz ./bin/bash; objdump -h bin/bash|awk --non-decimal-data '/[.]text/ {printf("%8d ","0x"$3)}'
ar x grep_2.12-2_$i.deb; tar xfz data.tar.gz ./bin/grep; objdump -h bin/grep|awk --non-decimal-data '/[.]text/ {printf("%8d ","0x"$3)}'
ar x gzip_1.5-1.1_$i.deb; tar xfz data.tar.gz ./bin/gzip; objdump -h bin/gzip|awk --non-decimal-data '/[.]text/ {printf("%8d ","0x"$3)}'
echo $i
done|sort -nk1
for i in amd64 arm64 armel armhf i386 mips mips64el mipsel powerpc ppc64el s390x; do
wget http://ftp.at.debian.org/debian/pool/main/b/bash/bash_4.4-5_$i.deb
wget http://ftp.at.debian.org/debian/pool/main/g/grep/grep_2.20-4.1_$i.deb
wget http://ftp.at.debian.org/debian/pool/main/g/gzip/gzip_1.6-4_$i.deb
done
for i in amd64 arm64 armel armhf i386 mips mips64el mipsel powerpc ppc64el s390x; do
ar x bash_4.4-5_$i.deb; tar xfJ data.tar.xz ./bin/bash; objdump -h bin/bash|awk --non-decimal-data '/[.]text/ {printf("%8d ","0x"$3)}'
ar x grep_2.20-4.1_$i.deb; tar xfJ data.tar.xz ./bin/grep; objdump -h bin/grep|awk --non-decimal-data '/[.]text/ {printf("%8d ","0x"$3)}'
ar x gzip_1.6-4_$i.deb; tar xfJ data.tar.xz ./bin/gzip; objdump -h bin/gzip|awk --non-decimal-data '/[.]text/ {printf("%8d ","0x"$3)}'
echo $i
done|sort -nk1

To reproduce, you should replace ftp.at.debian.org with your local
mirror.

Note that there is no mips64el package for grep and gzip, so the
outputs you see for these packages are nonsense (deleted below).

Here's the output for one set of architectures and package versions,
sorted by bash code size:

bash grep gzip
398384 88084 47944 armhf
584340 130872 68276 armel
588972 129096 66892 amd64
604656 131804 66268 i386
637620 133868 72712 s390
638912 140544 71744 sparc
674912 141120 74032 mipsel
674912 141168 74112 mips
680928 139664 74272 powerpc
688052 150680 75908 s390x
1539872 322432 158656 ia64

And here's the (edited) output for the other set:

bash grep gzip
510144 105100 46992 armhf
697794 148122 62506 amd64
712580 134116 59452 arm64
754436 152948 61724 i386
787780 157384 63268 armel
841024 171824 75536 powerpc
854272 177088 72600 s390x
899984 mips64el
916848 182400 77072 mipsel
917024 182432 77104 mips
971684 186296 84912 ppc64el

These are the CONTENT field of the .text section of the binaries. I
don't know what the CONTENT field means, so it may be the wrong size.

32-bit: armhf armel i386 s390 sparc mipsel mips powerpc
64-bit: arm64 amd64 ia64 s390x mips64el ppc64el

The difference between armhf and armel on these integer-only codes is
surprising. In the older set, mips and mipsel were smaller than
powerpc and s390x, in the newer they are larger. I take this as an
indicated that compiler direction (possibly influenced by
microarchitecture, such as favouring code alignment to bigger
boundaries) may have a larger influence on code size than the
instruction set. Overall, apart from armhf and ia64, the
architectures are remarkably similar in code size, especially in the
first set.

Followups set to comp.arch.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2017: http://euro.theforth.net/

MitchAlsup

unread,

Aug 9, 2017, 11:03:40 AM8/9/17

to

On Wednesday, August 9, 2017 at 8:43:42 AM UTC-5, Anton Ertl wrote:
> sorted by bash code size:
>
> bash grep gzip

> 398384 88084 47944 armhf // throw this away

> 584340 130872 68276 armel
> 588972 129096 66892 amd64
> 604656 131804 66268 i386
> 637620 133868 72712 s390
> 638912 140544 71744 sparc
> 674912 141120 74032 mipsel
> 674912 141168 74112 mips
> 680928 139664 74272 powerpc
> 688052 150680 75908 s390x

> 1539872 322432 158656 ia64 // throw this away

and you have a powerful argument that instruction sets don't matter.

Terje Mathisen

unread,

Aug 9, 2017, 11:13:36 AM8/9/17

to

Anton Ertl wrote:
> bash grep gzip
> 510144 105100 46992 armhf
> 697794 148122 62506 amd64
> 712580 134116 59452 arm64
> 754436 152948 61724 i386

[snip]

> The difference between armhf and armel on these integer-only codes is
> surprising. In the older set, mips and mipsel were smaller than
> powerpc and s390x, in the newer they are larger. I take this as an
> indicated that compiler direction (possibly influenced by
> microarchitecture, such as favouring code alignment to bigger
> boundaries) may have a larger influence on code size than the
> instruction set. Overall, apart from armhf and ia64, the
> architectures are remarkably similar in code size, especially in the
> first set.
>
> Followups set to comp.arch.
>

I agree, the size differences are mostly down in the noise at this point.

I'm somewhat surprised that amd64 is consistenly smaller than i386, it
would seem this has to be mostly due to calling conventions and/or
optimization levels?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Quadibloc

unread,

Aug 9, 2017, 11:18:52 AM8/9/17

to

On Wednesday, August 9, 2017 at 9:03:40 AM UTC-6, MitchAlsup wrote:

> and you have a powerful argument that instruction sets don't matter.

Except when they do - at the two extreme ends of the range, ARM's Thumb and Itanium.

It isn't actually all that surprising that the general-purpose instruction sets are similar, although indeed it _is_ a useful finding that x86 and 360 aren't shorter than 32-bit RISCs, so just going to variable-length CISC doesn't yield a worthwhile benefit.

John Savard

MitchAlsup

unread,

Aug 9, 2017, 11:40:43 AM8/9/17

to

On Wednesday, August 9, 2017 at 10:18:52 AM UTC-5, Quadibloc wrote:

> so just going to variable-length CISC doesn't yield a worthwhile benefit.

Take this to heart.

Quadibloc

unread,

Aug 9, 2017, 11:54:47 AM8/9/17

to

Ah, while I did notice its relevance to my designs, it now takes away an excuse for variable-length instructions, also needed for your immediates.

timca...@aol.com

unread,

Aug 9, 2017, 3:11:03 PM8/9/17

to

On Wednesday, August 9, 2017 at 11:13:36 AM UTC-4, Terje Mathisen wrote:

> I'm somewhat surprised that amd64 is consistenly smaller than i386, it
> would seem this has to be mostly due to calling conventions and/or
> optimization levels?
>

My guess would be fewer instructions because of less thrashing of values in registers. After all:

BGB

unread,

Aug 9, 2017, 3:32:29 PM8/9/17

to

yeah, it seems to be mostly that many ISAs tend towards a local optimum
WRT the usage of bits:
shorter-width ISAs end up needing more instructions for the same work;
longer-width ISAs tend towards doing more work per instruction;
...

x86 can do an OK amount of work per instruction, but typically needs
more bytes due to prefixes and sub-optimal use of coding space (much of
the single-byte range being used by now rarely-used byte-oriented
instructions).

x86-64 can probably be explained partly because more registers allows it
to lean towards needing fewer instructions (and less memory accesses)
mostly offsetting the added cost of the REX prefix.

so, things mostly balance out.

I suspect it is similar in concept to the Shannon Limit:
* https://en.wikipedia.org/wiki/Noisy-channel_coding_theorem

(though, typically, one can gain an additional ~30% or so via Deflate or
LZ4 compression or similar, 1; with some variation due to ISA).

1: I actually got pretty comparable results (for compressing executable
code) when using a compressor which worked entirely in terms of 32-bit
DWORDs (it matched sequences of DWORDs rather than Bytes; and the
encoded data was itself a series of DWORDs). as can be noted, it was
marginally more effective with ARM and SuperH code than with x86, but
both were fairly comparable to the LZ4 results (which in-turn were
fairly similar to the Deflate results).

as can be noted, this symmetry breaks down some when compressing
natural-language text.

going further in the CISC direction and also balancing instruction
encodings could get better density than x86, with the tradeoff of higher
internal complexity.

a design in this direction could be eliminating most of the conventional
registers in favor of a stack-relative space and a lot of 3-address
forms (basically, the stack pointer also serves as a base for
memory-mapped registers, and push/pop effectively moves the register
space; probably with a small rotating ring of registers for temporaries
or similar).

partial idea:
* 0000-cccd ddss-sttt Ld = Ls op Lt //operates on locals
* 0001-cccd ddss-sttt Ld = Ls op Rt //Operate on locals and ring
* 0010-ccc0 ssss-tttt R = Ls op Lt //Locals w/ result in ring
* 0010-ccc1 ssss-0ttt R = Ls op Rt //Locals+ring w/ result in ring
* 0010-ccc1 0sss-1ttt R = Rs op Rt //Locals+ring w/ result in ring
* 0011-0iii iiii-iiii R = I11 //Load 11-bit immed into ring.
* 0011-100i iiii-iiii R = R0+I9 //Ring + Imm9
* 0011-1010 ssss-ssss R = Ls //Load Local (8-bit index)
* 0011-1011 dddd-dddd Ld = R0 //Store last ring value to local
* 0011-1100 iiii-iiii SP=SP-Imm8 //adjust stack position (PUSH)
* 0011-1101 iiii-iiii SP=SP+Imm8 //adjust stack position (POP)
* ...

where ccc=ADD/SUB/MUL/AND/OR/XOR/SHL/SAR

so:
int x, y, z;
z=x+y*1021+1021;
could be expressed as:
33FD 2510 2118 1080

but, dunno...

BGB

unread,

Aug 9, 2017, 3:59:06 PM8/9/17

to

this is what I would suspect as well.

64-bit x86 code can generally operate with most of the working variables
being kept cached in registers, whereas 32-bit x86 tends to have much
shorter-lived cached-variables and a much higher density of
register/memory operations.

though, as a side-effect, it makes it harder to compete as well with
optimized 64-bit code than with optimized 32-bit code when using a naive
code-generator without a decent register allocator.

curiously, now 64-bit code seems to often have around a 30-40% speed
advantage over 32-bit code, whereas I think I remember early on they
were typically closer to break-even.

MitchAlsup

unread,

Aug 9, 2017, 6:07:47 PM8/9/17

to

On Wednesday, August 9, 2017 at 10:54:47 AM UTC-5, Quadibloc wrote:
> Ah, while I did notice its relevance to my designs, it now takes away an excuse for variable-length instructions, also needed for your immediates.

Not necessarily.
To get there one would need a count of executed instructions,
not just the footprint in memory.
Variable length instruction that supply immediates and displacements
execute fewer <longer> instructions.

Bruce Hoult

unread,

Aug 9, 2017, 6:40:11 PM8/9/17

to

On Wednesday, August 9, 2017 at 4:43:42 PM UTC+3, Anton Ertl wrote:
> bash grep gzip
> 510144 105100 46992 armhf
> 697794 148122 62506 amd64
> 712580 134116 59452 arm64
> 754436 152948 61724 i386
> 787780 157384 63268 armel
> 841024 171824 75536 powerpc
> 854272 177088 72600 s390x
> 899984 mips64el
> 916848 182400 77072 mipsel
> 917024 182432 77104 mips
> 971684 186296 84912 ppc64el

RISCV isn't in the main debian repo yet, but I found similar versions at http://riscv.mit.edu/debian/pool/main

bash 4.4-4
grep 2.27-2
gzip 1.6-5

702352 154316 100436 riscv64

bash fits between amd64 and arm64, grep between i386 and armel, and gzip is mysteriously much bigger than any of yours.

These are 64 bit binaries using purely 32 bit instructions. Adding use of the optional 16 bit instructions should reduce the size by 25% - 30%, to very similar to armhf for bash and grep, and somewhere around s390x for gzip.

BGB

unread,

Aug 9, 2017, 10:06:13 PM8/9/17

to

yeah.

I haven't done my own testing yet, but what I have seen/heard of RISC-V,
it makes it seem pretty promising.

even if, granted, I am off basically investing a bunch of effort into a
project that is at this point seeming a bit like trying to polish a
turd, but oh well...

already...@yahoo.com

unread,

Aug 10, 2017, 4:23:17 AM8/10/17

to

On Wednesday, August 9, 2017 at 4:43:42 PM UTC+3, Anton Ertl wrote:

Can you compile for 'mips32r6' and 'mips64r6' ?
Both targets are supported starting from gcc5.

Anton Ertl

unread,

Aug 10, 2017, 6:02:03 AM8/10/17

to

already...@yahoo.com writes:
>Can you compile for 'mips32r6' and 'mips64r6' ?

Not any easier than you.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Walter Banks

unread,

Aug 10, 2017, 6:33:04 AM8/10/17

to

Code generation tool sets may be another factor as well. Common tool
sets shape ISA's and and distort code generation. I have seen this in
embedded systems ISA's but when tool sets are written to generate code
specifically for an ISA the real differences between instruction sets
start to show.

Increasingly code size is determined by how compact the ISA can describe
the application. VM's are not more compact than many ISA's for the same
problem.

w..

w..