A (late) dicussion point in big vs little endian

Thomas Koenig

unread,

Dec 21, 2021, 8:13:30 AM12/21/21

to

The ship has pretty much sailed for big vs. little endian.

The Datapoint 2200 was a bit-serial design, which of course needs
little-endian words, Intel created the 8008 with that ISA, IBM
chose its successor 8086 for the PC, and PCs rolled up the market.

Which is better is a holy war of the past, now. I have only
been able to find one valid point: Comparing multiple bytes, for
example in memcmp. For big-engian, you can compare words, and the
results will be the same if you compared the bytes individually.
For little-endian, this is not the case, as the following test
program shows.

#include <stdio.h>
#include <string.h>

char a[4];
char b[4] = {1, 0, 0, 1};

int main()
{
unsigned int ai, bi;
for (int i=0; i<3; i++) {
a[0] = i;
for (int j=0; j<3; j++) {
a[3] = j;
memcpy (&ai, a, sizeof(int));
memcpy (&bi, b, sizeof(int));
printf ("ai < bi = %d,", ai < bi);
printf (" a < b = %d\n", memcmp (a, b, 4) < 0); }
}

return 0;
}

For an efficient memcmp on a big-endian machine, byte reversal is
needed (this can also be seen, for example, in the source of glibc).

It might make sense to put that into the load/store instructions
if the encoding and the timing allow.

EricP

unread,

Dec 21, 2021, 11:27:32 AM12/21/21

to

Right, to compare two multi-word values you start at the high end,
wherever the high end is located, and move towards the low end.

On the other hand, to add two multi-word values you start at the low end,
wherever the low end is located and move towards the high end.

Next, stacks - grow down or grow up?
SP points to last slot used or next slot empty?
It's a conundrum.

Thomas Koenig

unread,

Dec 22, 2021, 3:40:37 AM12/22/21

to

EricP <ThatWould...@thevillage.com> schrieb:

There is one bit of difference: How summation is arranged does
not matter (much) for performance as long as the whole word
is read in in one chunk. For little-endian, you have to do
some instruction for byte swapping for memcmp().

Quadibloc

unread,

Dec 22, 2021, 10:36:03 AM12/22/21

to

On Wednesday, December 22, 2021 at 1:40:37 AM UTC-7, Thomas Koenig wrote:

> There is one bit of difference: How summation is arranged does
> not matter (much) for performance as long as the whole word
> is read in in one chunk. For little-endian, you have to do
> some instruction for byte swapping for memcmp().

That's a new point, but comparison versus addition was noted at the
very beginning of the little-endian versus big-endian argument.

A while back, I pointed out another argument for big-endian that I
hadn't seen, but it may have been around.

If a computer has a packed decimal data type, then big-endian has
an advantage because:

Packed decimal is meant to facilitate arithmetic on numbers contained
in character strings. The character strings have big-endian order, so
packed decimal should be big-endian to facilitate this.

Because the zone bits are squeezed out, packed decimal resembles binary;
an arithmetic unit could handle both packed decimal and binary if you could
change what was done to generate carries for every four bits. But that only
works conveniently when packed decimal and binary have the same endianness.

Of course, the IBM System/360, which has packed decimal, is big-endian.

This may not be a decisive point to some, because some will question
whether or not there is any need for packed decimal arithmetic in a
computer.

John Savard

MitchAlsup

unread,

Dec 22, 2021, 10:58:18 AM12/22/21

to

People like Little Endian for performing (integer) arithmetic on
strings because the carries need to move in LE fashion and the
overall is more efficient.
<
People like Big Endian for performing (nteger) compares on strings
because there is no-need for carries to determine the result and thus it
is more efficient top-to-bottom.
<
So, it seems to me the difference is the carry (lack or presence) not
the order in memory.

Terje Mathisen

unread,

Dec 22, 2021, 11:08:19 AM12/22/21

to

Since you only need to swap once, at the very end, after detecting the
first word with a difference, it _really_ doesn't matter.

>>>
>>> It might make sense to put that into the load/store instructions
>>> if the encoding and the timing allow.
>>
>> Right, to compare two multi-word values you start at the high end,
>> wherever the high end is located, and move towards the low end.
>>
>> On the other hand, to add two multi-word values you start at the low end,
>> wherever the low end is located and move towards the high end.
>
> There is one bit of difference: How summation is arranged does
> not matter (much) for performance as long as the whole word
> is read in in one chunk. For little-endian, you have to do
> some instruction for byte swapping for memcmp().

Again, see above: You just test for equality, then handle endian issues
at the end.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Dec 22, 2021, 11:10:13 AM12/22/21

to

Rather, I would question need to store packed decimal in little endian
order, if this should become a performance issue.

Thomas Koenig

unread,

Dec 22, 2021, 11:49:37 AM12/22/21

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

Fair enough.

There is just one (extremely minor) point left: You might not need
a byte-swap instruction on a big-endian architecture. Network
byte order happens to be big-endian and you can do the memcmp
without it.

(POWER, for example, only gained that instruction for the VSX
registers, which were introduced rather late).

EricP

unread,

Dec 22, 2021, 11:58:29 AM12/22/21

to

You are doing a multiple byte SIMD compare in a 32 bit register.
Just have proper SIMD compare instructions that order the
operand fields low to high.
(Besides on either hardware if the string buffer length is not a nice
multiple of 4 you are going to spend more instructions fiddling about
with the string tail than the swap would ever cost.)

Summation does matter as branch and mem[reg+offset] occur a lot.
There is an advantage to little endian on simple 8 bit processors
(e.g. 8080) with 16 bit immediates like branch offsets or register offsets.
The low byte of the offset can be fetched and routed directly to the
ALU and added in the same clock with the result saved in a temp reg.
Then the high byte is fetched directly into the ALU,
add with temp-carry and saved in a temp reg.
Then move temp to PC for branch or Mem Address Reg (MAR) for memory access.
Takes 3 cycles.

With a big endian offset the high byte is read first and must be saved
in a temp reg, then the low byte is read and routed direct to the ALU,
then high byte added, then move temp result to PC or MAR.
Takes 4 cycles.

MitchAlsup

unread,

Dec 22, 2021, 12:17:16 PM12/22/21

to

Why not just compare bytes from the end towards the beginning ?

John Levine

unread,

Dec 22, 2021, 2:13:13 PM12/22/21

to

According to Terje Mathisen <terje.m...@tmsw.no>:

>> This may not be a decisive point to some, because some will question
>> whether or not there is any need for packed decimal arithmetic in a
>> computer.
>
>Rather, I would question need to store packed decimal in little endian
>order, if this should become a performance issue.

The S/360 packed decimal format, which I think everyone else uses give
or take the sign representation, puts the sign in the low
nibble. So on the big-endian S/360 it has to find and fetch the low
byte before it can do anything else.

Its decimal instructions have an address and a fixed length in the
instruction, so finding the low byte is one add. Meh.

The earlier 1400 was de facto little endian. It was character addressable,
numbers stored low digit first with a special word mark bit on the high
digit. The 1620 addressing did the same thing although it was otherwise fairly
different from the 1400.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

John Levine

unread,

Dec 22, 2021, 3:31:09 PM12/22/21

to

According to Thomas Koenig <tko...@netcologne.de>:

>There is just one (extremely minor) point left: You might not need
>a byte-swap instruction on a big-endian architecture. Network
>byte order happens to be big-endian and you can do the memcmp
>without it.

I think you're confusing cause and effect. Network byte order was
defined before the PDP-11 shipped, so at the time big-endian was
all there was.

I don't see reordering the bytes in packet headers as a large part
of the load on any systems I use.

Anywhere it matters like in routers, there's special hardware.

Marcus

unread,

Dec 23, 2021, 1:13:26 AM12/23/21

to

I'm pretty sure that's how the 6502 did it. I didn't understand why it
used little endian at the time, but it makes perfect sense from an 8-bit
ALU / 8-bit data bus / 16-bit address calculation point of view.

Anton Ertl

unread,

Dec 23, 2021, 4:27:16 AM12/23/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>Terje Mathisen <terje.m...@tmsw.no> schrieb:

>> Since you only need to swap once, at the very end, after detecting the
>> first word with a difference, it _really_ doesn't matter.

...

>(POWER, for example, only gained that instruction for the VSX
>registers, which were introduced rather late).

PowerPC has had lwbrx from the start and it is implemented in the
PowerPC 601. AFAIK the 601 core is the same as the Power RSC, so I
expect that Power also had lwbrx at that time. If you mean an
reg->reg byte-swap instruction, I guess the need is reduced if you
have byte-swapping load and store instructions like lwbrx.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>