Bit Manipulation and Big Endian support

1,409 views
Skip to first unread message

Allen Baum

unread,
Jun 8, 2018, 5:02:38 PM6/8/18
to RISC-V ISA Dev
i've heard more than once that in Japan (and perhaps China), the overwhelming majority of microcontrollers and embedded processors are Big-Endian. This is, I believe, inhibiting the support of RiscV in those geopgraphies (not eliminating it, but certainly slowing it down ).

The only support that I've heard of for big-endian is the currently defunct BitManipulation WG, and even there the support was for swapping bytes after they've been loaded and before they are store.

a. Is that adequate?
b. If not, do we expect anyone who wants native BigEndian support to develop their own custom extension?
c. if not - have there been any discussions for a standard BigEndian discussion?

Tommy Thorn

unread,
Jun 8, 2018, 5:05:54 PM6/8/18
to Allen Baum, RISC-V ISA Dev
Hi Allen,

One data point: RISC-V was originally bi-endian, but overwhelmingly the western world have settled on little and it greatly simplified the standard to drop it.  I don't think adding native BigEndian makes sense but adding support for various swaps does make a ton of sense.

Tommy

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DCtyw7%2B8vbgVj3rkod7tOnPt-huj8v4WdKg%3DuhPY%3DtHxw%40mail.gmail.com.

Samuel Falvo II

unread,
Jun 8, 2018, 5:43:54 PM6/8/18
to Tommy Thorn, Allen Baum, RISC-V ISA Dev
On Fri, Jun 8, 2018 at 2:05 PM, Tommy Thorn
<tommy...@esperantotech.com> wrote:
> Hi Allen,
>
> One data point: RISC-V was originally bi-endian, but overwhelmingly the
> western world have settled on little and it greatly simplified the standard
> to drop it. I don't think adding native BigEndian makes sense but adding
> support for various swaps does make a ton of sense.

Just my opinion on this matter.

While adding swaps is useful in some cases, more useful is having
native big- and little-endian memory accessors. If you look at the
overwhelming majority of use-cases for byte-swap operations, it's
always to "fix" the endianness of a fetched field before subsequent
processing, and again prior to storing results back into the field
(e.g., BSD sockets' hton-family of macros). Eliminating those swaps
seems more useful. Being able to declare inside structures which
fields are explicitly big- or little-endian also vastly improves the
readability of program source listings.

--
Samuel A. Falvo II

Allen Baum

unread,
Jun 8, 2018, 5:50:21 PM6/8/18
to Samuel Falvo II, Tommy Thorn, RISC-V ISA Dev
So 4 possible levels of support:
 - none
 - swap instructions
 - BigEndian load/store mode
 - BigEndian load/store instructions

Luke Kenneth Casson Leighton

unread,
Jun 8, 2018, 8:17:32 PM6/8/18
to Allen Baum, RISC-V ISA Dev
On Fri, Jun 8, 2018 at 10:02 PM, Allen Baum
<allen...@esperantotech.com> wrote:

> i've heard more than once that in Japan (and perhaps China), the
> overwhelming majority of microcontrollers and embedded processors are
> Big-Endian.

in japan: PowerPC, yes. at barcelona i met someone from japan who
turned out to be the unofficial host of the powerpc-be debian port.
he was specifically there to guage the practicality of making RISC-V
bi-endian. ( also, just worth observing: Andes V3 is bi-endian )

note that that's not *if* to make RISC-V bi-endian, but *how* to make
RISC-V bi-endian.

i did not take notes unfortunately, so i do not know his name. he
mentioned that he was returning to japan with a report, with a view to
applying for government funding to get this done. i introduced him to
manuel (mafm on OFTC #debian-riscv) so he could get a rough idea of
how much work would be involved in debootstrapping a riscv-be debian
port, and i noticed he was talking to yunsup as well, i did not take
part in that conversation.

basically the powerpc-be community in japan is so enormous and the
software base so large that they cannot just "drop everything" and
convert to little-endian architectures: they *need* bi-endian-ness (in
some fashion).

l.

Shumpei Kawasaki

unread,
Jun 8, 2018, 8:23:04 PM6/8/18
to Luke Kenneth Casson Leighton, Allen Baum, RISC-V ISA Dev

I made that comment in the marketing members' meeting. 

99 percent of PC, mobile and data center applications are little-endian but it is also true that 90 percent industrial and infrastructure applications are big-endian. ARM users actively use its big-endian mode and will continue to do so. 

The bi-endian feature can improve performance or simplify the logic of networking devices and software. Many architectures (ARM, PowerPC, Alpha, SPARC V9, MIPS, PA-RISC, SuperH SH-4 and IA-64) feature a setting which allows for switchable endianness in data segments, code segments or both (Source: https://en.wikipedia.org/wiki/Endianness).  

GNU Compiler Collections, binutils, Linux, UEFI and other cross tools and OSes support bi-endian in clean manners. It is more cross tool work that is needed and work in hardware. We can start some ground work to provide a bi-endian platform for RISC-V and RISC-V GCC shows no prior bi-endian work so developers will need to work with community.   We know that this reduces porting work involved in convert applications to RISC-V. 

This feature on RISC-V will creates an easier transition path from PowerPC, SH, 68K, and Coldfire. 

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jacob Bachmeyer

unread,
Jun 8, 2018, 9:39:55 PM6/8/18
to Shumpei Kawasaki, Luke Kenneth Casson Leighton, Allen Baum, RISC-V ISA Dev
Shumpei Kawasaki wrote:
> I made that comment in the marketing members' meeting.
>
> 99 percent of PC, mobile and data center applications are
> little-endian but it is also true that 90 percent industrial and
> infrastructure applications are big-endian. ARM users actively use its
> big-endian mode and will continue to do so.
>
> The bi-endian feature can improve performance or simplify the logic of
> networking devices and software. Many architectures (ARM, PowerPC,
> Alpha, SPARC V9, MIPS, PA-RISC, SuperH SH-4 and IA-64) feature a
> setting which allows for switchable endianness in data segments, code
> segments or both (Source: https://en.wikipedia.org/wiki/Endianness).
>
> GNU Compiler Collections, binutils, Linux, UEFI and other cross tools
> and OSes support bi-endian in clean manners. It is more cross tool
> work that is needed and work in hardware. We can start some ground
> work to provide a bi-endian platform for RISC-V and RISC-V GCC shows
> no prior bi-endian work so developers will need to work with
> community. We know that this reduces porting work involved in
> convert applications to RISC-V.

Would big-endian data memory access opcodes be a good solution? There
would be the small complexity that RISC-V program text would always be
little endian, but that is needed due to the instruction length encoding.

There is no room in the 32-bit opcode space to put big-endian
LOAD/STORE, but an extension could easily add these as 48-bit or 64-bit
opcodes. The big advantage I see from such a bi-endian extension is
that it would make RISC-V truly bi-endian with native memory access in
either order as needed. (For the extreme embedded case, new standard
long-form big-endian memory access opcodes could be "aliased" into
CUSTOM-0/CUSTOM-1 to fit them on a 32-bit-instruction-only machine.)


-- Jacob

Luke Kenneth Casson Leighton

unread,
Jun 8, 2018, 10:01:38 PM6/8/18
to Jacob Bachmeyer, Shumpei Kawasaki, Allen Baum, RISC-V ISA Dev
On Sat, Jun 9, 2018 at 2:39 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> There is no room in the 32-bit opcode space to put big-endian LOAD/STORE,
> but an extension could easily add these as 48-bit or 64-bit opcodes.

the consequences of that are that it would make big-endian a "second
rate citizen"... although at this point it's almost too late.
https://en.wikipedia.org/wiki/Endianness#Bi-endianness seems to me to
imply that the equivalent of CSRs may have been used historically to
set endian-ness.

l.

Jacob Bachmeyer

unread,
Jun 8, 2018, 10:13:15 PM6/8/18
to Luke Kenneth Casson Leighton, Shumpei Kawasaki, Allen Baum, RISC-V ISA Dev
I do not see a serious problem here, since that proverbial ship has
arguably already sailed: RISC-V program text is little-endian, and
changing *that* would make a huge mess. Further, the use of additional
big-endian memory access opcodes would make RISC-V truly bi-endian, with
the big-endian/little-endian distinction being made at runtime and
encoded into the program text, rather than being an implicit parameter.
I argue that this is a better fit, since it would allow/require the
expected byte order for data to be explicitly stated in the program.

Lastly, (and this ties back to the extensible assembler database I
proposed earlier) standardizing big-endian memory access as 48-bit or
64-bit opcodes does not preclude implementations from "aliasing" those
long-form standard opcodes into the 32-bit opcode space as non-standard
encodings of standard instructions.


-- Jacob

Bruce Hoult

unread,
Jun 8, 2018, 10:27:10 PM6/8/18
to Jacob Bachmeyer, Shumpei Kawasaki, Luke Kenneth Casson Leighton, Allen Baum, RISC-V ISA Dev
I imagine it might be possible to find room for simple sized big endian load and store without any offset (or indexing). It would need 14 bits of opcode space: 1 for load/store, 3 for size/type, 5 for pointer register, 5 for src/dest register.  Or 2x 13 bits, obviously.

The performance-critical uses (where a swap instruction MIGHT not be enough) are likely to be stepping through data in a loop, and not need an offset.

Jim Wilson

unread,
Jun 8, 2018, 10:27:17 PM6/8/18
to Jacob Bachmeyer, Shumpei Kawasaki, Luke Kenneth Casson Leighton, Allen Baum, RISC-V ISA Dev
On Fri, Jun 8, 2018 at 6:39 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Would big-endian data memory access opcodes be a good solution? There would
> be the small complexity that RISC-V program text would always be little
> endian, but that is needed due to the instruction length encoding.

On ARMv7 and later, the code is always little endian, even when the
processor is in big-endian mode. It is only the data accesses that
change, as it is only the data accesses that matter to end users. I
think ARMv6 has support for both big and little endian code, depending
on a mode bit, but the big-endian code stuff was only for backwards
compatibility with older ARM processors, and was dropped in ARMv7.

Jim

Shumpei Kawasaki

unread,
Jun 8, 2018, 10:27:54 PM6/8/18
to jcb6...@gmail.com, Luke Kenneth Casson Leighton, Allen Baum, RISC-V ISA Dev, Oleg Endo, Akira Tsukamoto

Compared to the languages like Pascal, C would let you access a data in more than one way e.g. structure, union, etc.. Handling these constracts involve bit arrangements on the top of byte arrangements. Microsoft and Hitachi ported .NET Micro-framework, originally little-endian, to big-endian SHs. It took engineers four times longer from what we initially anticipated taking very long time to shake out issues. The programmers involved were all systems programmers.

SH has swap instruction. ARM and PowerPC have endian swap instructions for handling endian, and all also offer bi-endian options. Enabling endian swap instructions in high-level language programming is not that straightforward. Linux network driver code is layered in such a way high-level functions abstract out endian and then endian-aware code at the bottom layer of functions. 

-Shumpei

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Andrew Waterman

unread,
Jun 8, 2018, 11:31:06 PM6/8/18
to jcb6...@gmail.com, Allen Baum, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Shumpei Kawasaki
IMO, bi-endianness isn’t enough of a goal to give the big-endian loads and stores 12-bit offsets. If they are just register-indirect, they can be encoded more cheaply in the 32-bit space.

(FWIW, I still favor byte-swap instructions for this purpose. That’s what we/Tommy proposed in the original B extension proposal years ago.)




-- Jacob


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Samuel Falvo II

unread,
Jun 8, 2018, 11:35:50 PM6/8/18
to Andrew Waterman, Jacob Bachmeyer, Allen Baum, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Shumpei Kawasaki
On Fri, Jun 8, 2018 at 8:30 PM, Andrew Waterman <and...@sifive.com> wrote:
> (FWIW, I still favor byte-swap instructions for this purpose. That’s what
> we/Tommy proposed in the original B extension proposal years ago.)

I won't contest this. My comment did clearly state that it was an
opinion, and thus, not really backed by any kind of science. I
suppose it's possible to fuse load-then-swap and swap-then-store
sequences to get comparable performance benefits; the disadvantage, of
course, would be greater space consumption.

Richard Herveille

unread,
Jun 9, 2018, 12:11:59 AM6/9/18
to Allen Baum, Samuel Falvo II, Tommy Thorn, RISC-V ISA Dev, Richard Herveille

 

On 08/06/2018, 23:50, "Allen Baum" <allen...@esperantotech.com> wrote:

 

So 4 possible levels of support:

 - none

 - swap instructions

 - BigEndian load/store mode

 - BigEndian load/store instructions

 

 

I doubt little vs big-endian is an issue for adoption. We see a lot of requests from China, admittedly less from Japan. But then Japan is considered conservative.

Instead of declaring new opcodes for big-endian access, wouldn’t declaring the memory space/region big-endian be sufficient? That could be endoded in the MMU record, the PMA and/or the PMP records.

This requires no changes to the CPU pipeline. The CPU then just always works in little-endian mode. There’s no need for new opcodes or additional byte-swap instructions. The data is just loaded/stored in little/big endian format.

 

Cheers,

Richard

 

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Albert Cahalan

unread,
Jun 9, 2018, 12:27:43 AM6/9/18
to Tommy Thorn, Allen Baum, RISC-V ISA Dev
On 6/8/18, Tommy Thorn <tommy...@esperantotech.com> wrote:

> One data point: RISC-V was originally bi-endian, but overwhelmingly the
> western world have settled on little and it greatly simplified the standard
> to drop it. I don't think adding native BigEndian makes sense but adding
> support for various swaps does make a ton of sense.

I have three suggestions for what "various swaps" might be.

The first is that an immediate value determines the swapping.
Viewing the register as an array of bits, the source index for
each destination bit is determined by XORing the immediate
value with the destination index of that bit. Thus a value of 0x00
does nothing, a value of 0x01 swaps adjacent bits, a value of
0x04 swaps adjacent nibbles, a value of 0x08 swaps adjacent
bytes (a vector htons), a value of 0x18 does 1 to 4 of htonl,
a value of 0x07 does bit reversal within bytes, etc.

The second is that sign extension might commonly follow a
swapping operation. The above would need an extra 3 bits
to specify the size, for a total of 10 on 128-bit RISC-V. Alone,
it only takes 2 bits. It is less important to have unsigned versions
because sign bits are conveniently cleared by smaller-sized
stores to memory, though that would just take another bit.
Since the RISC-V immediates tend to be 11-bit, it is available.

The third is that shuffle instructions can handle byte swapping.

Jacob Bachmeyer

unread,
Jun 9, 2018, 12:44:47 AM6/9/18
to Shumpei Kawasaki, Luke Kenneth Casson Leighton, Allen Baum, RISC-V ISA Dev, Oleg Endo, Akira Tsukamoto
Shumpei Kawasaki wrote:
> Compared to the languages like Pascal, C would let you access a data
> in more than one way e.g. structure, union, etc.. Handling these
> constracts involve bit arrangements on the top of byte arrangements.
> Microsoft and Hitachi ported .NET Micro-framework, originally
> little-endian, to big-endian SHs. It took engineers four times longer
> from what we initially anticipated taking very long time to shake out
> issues. The programmers involved were all systems programmers.
>
> SH has swap instruction. ARM and PowerPC have endian swap instructions
> for handling endian, and all also offer bi-endian options. Enabling
> endian swap instructions in high-level language programming is not
> that straightforward. Linux network driver code is layered in such a
> way high-level functions abstract out endian and then endian-aware
> code at the bottom layer of functions.

I appear to have been misunderstood. I am proposing additional
big-endian LOAD/STORE opcodes. In high-level code (C is high-level
enough) endianness would be indicated per-datum, possibly using
attributes and defaulting to little-endian if unspecified. The compiler
then uses the big-endian memory access instructions when accessing data
that is big-endian according to its type.

For example, a TCP header could be a simple struct with
attribute((big_endian)) applied. GCC would then know to access
multi-byte fields in that struct using big-endian opcodes.


-- Jacob

Jacob Bachmeyer

unread,
Jun 9, 2018, 12:51:49 AM6/9/18
to Andrew Waterman, Allen Baum, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Shumpei Kawasaki
Andrew Waterman wrote:
> [...]
> IMO, bi-endianness isn’t enough of a goal to give the big-endian loads
> and stores 12-bit offsets. If they are just register-indirect, they
> can be encoded more cheaply in the 32-bit space.
>
> (FWIW, I still favor byte-swap instructions for this purpose. That’s
> what we/Tommy proposed in the original B extension proposal years ago.)

I only offered the suggestion because it appears that there is interest
from parties who seem to find byte-swap insufficient and I was trying to
keep them "at parity" with the baseline LOAD/STORE as much as possible.

This suggests that big-endian LOAD/STORE could be assembler
pseudo-instructions combining a byte-swap and an ordinary LOAD/STORE.
Those assembler pseudo-instructions could be overridden using the
extensible assembler database for hardware that actually does define
(non-standard) encodings for big-endian LOAD/STORE. As Sam Falvo seems
to have suggested in his reply, the "standard 64-bit encoding" for
big-endian LOAD/STORE could be a pair of 32-bit instructions.


-- Jacob

ron minnich

unread,
Jun 9, 2018, 1:09:22 AM6/9/18
to jcb6...@gmail.com, Shumpei Kawasaki, Luke Kenneth Casson Leighton, Allen Baum, RISC-V ISA Dev, Oleg Endo, Akira Tsukamoto
On Fri, Jun 8, 2018 at 9:44 PM Jacob Bachmeyer <jcb6...@gmail.com> wrote:


For example, a TCP header could be a simple struct with
attribute((big_endian)) applied.  GCC would then know to access
multi-byte fields in that struct using big-endian opcodes.


as regards something like this, I've never seen a convincing argument re performance that we need to tag data with endian attributes. 

And I'm mentioned as one of the guys who pushed such a bad idea in the now-withdrawn https://standards.ieee.org/findstds/standard/1596.5-1993.html, so in my dark past, I even believed in this kind of thing. Oops.

We did a test a few years back and as of gcc 6, it's pretty smart about turning certain sequences of byte access into single word load/store.  

As regards most code that thinks it needs to be endian-aware, this particular note is useful:

I've found that Rob's note is correct far more often than not. 

Andrew Waterman

unread,
Jun 9, 2018, 4:46:50 AM6/9/18
to Samuel Falvo II, Allen Baum, Jacob Bachmeyer, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Shumpei Kawasaki
Yeah, I agree with the intuition behind your previous email. What underpins my preference for the byte-swap instruction approach is that it gets the lion’s share of the benefit, and it’s an easier ask of both HW implementors and software-stack maintainers.

The fusion argument is relevant, since big-endian memory ops will either be 48-bit instructions or 32-bit instructions with limited addressing modes. The addressing mode might make it become a 48- or 64-bit sequence, anyway. So fusing an RVI or RVC memory access with a 32-bit byte-swap instruction could be similarly efficient in many cases.

Luke Kenneth Casson Leighton

unread,
Jun 9, 2018, 5:06:12 AM6/9/18
to Andrew Waterman, Samuel Falvo II, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev, Shumpei Kawasaki
it does mean that big-endian would come with an instruction-cache and
power-usage hit. would anyone have an idea of what kind of ratios
such big-endian load/stores would be in terms of total numbers of
instructions executed?

l.

Andrew Waterman

unread,
Jun 9, 2018, 5:31:11 AM6/9/18
to Luke Kenneth Casson Leighton, Samuel Falvo II, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev, Shumpei Kawasaki
My point is that there will be such a hit, no matter which approach is
taken. There's effectively no room in RVC to encode new big-endian
loads and stores. There's effectively no room in RVI to encode new
big-endian loads and stores with 12-bit offsets. So, you're left
either wider instructions or two-instruction sequences. Both are
defensible, though the latter is less onerous.

>
> l.

Luke Kenneth Casson Leighton

unread,
Jun 9, 2018, 5:45:43 AM6/9/18
to Andrew Waterman, Samuel Falvo II, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev, Shumpei Kawasaki
On Sat, Jun 9, 2018 at 10:30 AM, Andrew Waterman <and...@sifive.com> wrote:

> My point is that there will be such a hit, no matter which approach is
> taken. There's effectively no room in RVC to encode new big-endian
> loads and stores. There's effectively no room in RVI to encode new
> big-endian loads and stores with 12-bit offsets. So, you're left
> either wider instructions or two-instruction sequences.

there is another option: the conflict-resolution scheme. it was
discussed a couple months back, and is "effectively" as if 32-bit (or
other sized) opcodes had been extended (by some hidden bits that are
set with a CSR).

using that scheme the actual meaning of existing opcodes may be
"redirected" to a completely different execution engine, *without*
impact on the pipeline speed or introducing extra latency [1], and,
crucially, allowing the processor to be switched back to "standard"
meanings very very quickly.

conceptually it's exactly like c++ namespaces "using ABC".

... now that i think about it, any existing processor that switches
implicitly between big-endian and litte-endian execution meanings of
its instructions probably has something near-identical to this going
on under the hood.

would there be anything in RISC-V that prevented or prohibited the
creation of a "using bigendian" namespace, such that the select few
instructions which needed different behaviour would be redirected to
alternative execution engines?

l.

[1] several people raised the concern during the discussion that extra
latency would be introduced into the decode phase: (a) this isn't true
as the decode muxer just has a couple of extra hidden bits into the
selection AND gate (b) MISA *already* enables/disables instructions so
the concept of switching instructions on / off is required and
well-understood, and there have been no complaints from implementors
about MISA introducing pipeline latency.

Luke Kenneth Casson Leighton

unread,
Jun 9, 2018, 5:53:17 AM6/9/18
to Andrew Waterman, Samuel Falvo II, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev, Shumpei Kawasaki
On Sat, Jun 9, 2018 at 10:45 AM, Luke Kenneth Casson Leighton
<lk...@lkcl.net> wrote:
> On Sat, Jun 9, 2018 at 10:30 AM, Andrew Waterman <and...@sifive.com> wrote:
>
>> My point is that there will be such a hit, no matter which approach is
>> taken. There's effectively no room in RVC to encode new big-endian
>> loads and stores. There's effectively no room in RVI to encode new
>> big-endian loads and stores with 12-bit offsets. So, you're left
>> either wider instructions or two-instruction sequences.
>
> there is another option: the conflict-resolution scheme. it was
> discussed a couple months back, and is "effectively" as if 32-bit (or
> other sized) opcodes had been extended (by some hidden bits that are
> set with a CSR).

p.s. jacob already came up with a corresponding / matching scheme for
compilers / binutils, which takes the hidden prefix into account and
walks it through from gcc to binutils to actual assembler.

Andrew Waterman

unread,
Jun 9, 2018, 5:54:50 AM6/9/18
to Luke Kenneth Casson Leighton, Samuel Falvo II, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev, Shumpei Kawasaki
On Sat, Jun 9, 2018 at 2:45 AM, Luke Kenneth Casson Leighton
<lk...@lkcl.net> wrote:
> On Sat, Jun 9, 2018 at 10:30 AM, Andrew Waterman <and...@sifive.com> wrote:
>
>> My point is that there will be such a hit, no matter which approach is
>> taken. There's effectively no room in RVC to encode new big-endian
>> loads and stores. There's effectively no room in RVI to encode new
>> big-endian loads and stores with 12-bit offsets. So, you're left
>> either wider instructions or two-instruction sequences.
>
> there is another option: the conflict-resolution scheme. it was
> discussed a couple months back, and is "effectively" as if 32-bit (or
> other sized) opcodes had been extended (by some hidden bits that are
> set with a CSR).
>
> using that scheme the actual meaning of existing opcodes may be
> "redirected" to a completely different execution engine, *without*
> impact on the pipeline speed or introducing extra latency [1], and,
> crucially, allowing the processor to be switched back to "standard"
> meanings very very quickly.

I agree that extending the opcode by a few bits will not materially
exacerbate decode latency.

But this issue isn't anywhere near important enough to merit such an
elaborate strategy. Either Sam's or my/Tommy's solution is
sufficient.

>
> conceptually it's exactly like c++ namespaces "using ABC".
>
> ... now that i think about it, any existing processor that switches
> implicitly between big-endian and litte-endian execution meanings of
> its instructions probably has something near-identical to this going
> on under the hood.
>
> would there be anything in RISC-V that prevented or prohibited the
> creation of a "using bigendian" namespace, such that the select few
> instructions which needed different behaviour would be redirected to
> alternative execution engines?
>
> l.
>
> [1] several people raised the concern during the discussion that extra
> latency would be introduced into the decode phase: (a) this isn't true
> as the decode muxer just has a couple of extra hidden bits into the
> selection AND gate (b) MISA *already* enables/disables instructions so
> the concept of switching instructions on / off is required and
> well-understood, and there have been no complaints from implementors
> about MISA introducing pipeline latency.
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPweEDwETZqoHqnKQMr1cHfNbxkVYMrBLYXoE%2BT-ZAqhO0ydVA%40mail.gmail.com.

Luke Kenneth Casson Leighton

unread,
Jun 9, 2018, 6:23:08 AM6/9/18
to Andrew Waterman, Samuel Falvo II, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev, Shumpei Kawasaki
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68


On Sat, Jun 9, 2018 at 10:54 AM, Andrew Waterman <and...@sifive.com> wrote:
> On Sat, Jun 9, 2018 at 2:45 AM, Luke Kenneth Casson Leighton
> <lk...@lkcl.net> wrote:
>> On Sat, Jun 9, 2018 at 10:30 AM, Andrew Waterman <and...@sifive.com> wrote:
>>
>>> My point is that there will be such a hit, no matter which approach is
>>> taken. There's effectively no room in RVC to encode new big-endian
>>> loads and stores. There's effectively no room in RVI to encode new
>>> big-endian loads and stores with 12-bit offsets. So, you're left
>>> either wider instructions or two-instruction sequences.
>>
>> there is another option: the conflict-resolution scheme. it was
>> discussed a couple months back, and is "effectively" as if 32-bit (or
>> other sized) opcodes had been extended (by some hidden bits that are
>> set with a CSR).
>>
>> using that scheme the actual meaning of existing opcodes may be
>> "redirected" to a completely different execution engine, *without*
>> impact on the pipeline speed or introducing extra latency [1], and,
>> crucially, allowing the processor to be switched back to "standard"
>> meanings very very quickly.
>
> I agree that extending the opcode by a few bits will not materially
> exacerbate decode latency.
>
> But this issue isn't anywhere near important enough to merit such an
> elaborate strategy.

if it's considered elaborate then it's been completely misunderstood:
the scheme is simply a generalisation of a well-used (but probably not
that well-documented) technique. i would go so far as to speculate
that it so *un*elaborate, being quite literally no more than putting a
couple extra bits into the AND gate of a given instruction at decode
phase, that teams using the technique to create dynamic bi-endian
processors didn't see fit to give it a name! :)

> Either Sam's or my/Tommy's solution is sufficient.

... with performance / power penalties that may or may not be
acceptable to an implementor.

luckily the conflict-resolution scheme fits within the RISC-V rules
(which say that even standard opcodes may be given different meanings)
so there is no conflict even with the RISC-V ISA Manual, even to the
point where a processor may apply for (and receive) a Conformance
Certificate. i.e. it doesn't need the RISC-V Foundation's approval to
implement.

l.

Andrew Waterman

unread,
Jun 9, 2018, 6:56:53 AM6/9/18
to Luke Kenneth Casson Leighton, Samuel Falvo II, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev, Shumpei Kawasaki
On Sat, Jun 9, 2018 at 3:22 AM, Luke Kenneth Casson Leighton
The hardware's trivial. It's elaborate because dynamically
repurposing opcodes significantly complicates the software story.

>
>> Either Sam's or my/Tommy's solution is sufficient.
>
> ... with performance / power penalties that may or may not be
> acceptable to an implementor.
>
> luckily the conflict-resolution scheme fits within the RISC-V rules
> (which say that even standard opcodes may be given different meanings)
> so there is no conflict even with the RISC-V ISA Manual, even to the
> point where a processor may apply for (and receive) a Conformance
> Certificate. i.e. it doesn't need the RISC-V Foundation's approval to
> implement.
>
> l.
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPweEDxaXn0mCQY0JtxKJQonPuCusQ4hxkfDo2PujYJEtvgmrQ%40mail.gmail.com.

Gavin Stark

unread,
Jun 9, 2018, 7:39:18 AM6/9/18
to RISC-V ISA Dev, and...@sifive.com, sam....@gmail.com, allen...@esperantotech.com, jcb6...@gmail.com, shumpei....@swhwc.com
It is with great hesitation that I join the fray here... I have many scars...

Firstly, endianness is really only about how one interprets units of size X in as another entity of size Y. This is only a hardware issue if there is hardware that does this interpreting.

From what I can tell the instruction stream is already defined for any endianness; this is because the instruction stream regards the memory as a stream of 16-bit entities, and a RISC-V instruction is constructed from 'byte' address Z as its bottom 16 bits and (if required) from 'byte' address Z+2 as its next 16 bits, and so on. This is in section 1.2. How the 16-bits are stored in the two byes at Z is 'according to the implementation’s natural endianness'.
Effectively this means that instructions are units of size 16 bits and may be interpreted as another entity of size 16 bits, or 32 bits, or 48 bits; this is a hardware issue, and the specification is clear on how it should be done.

As a side note, this is an issue for JIT compilers and the toolchain; as noted above, the ARM implementation is similar in that instructions are stored in 'little endian'. 

For the data side I don't believe there is an endianness issue in the ISA, with one caveat: section 2.6 states ' RV32I provides a 32-bit user address space that is byte-addressed and little-endian.' Without this statement RISC-V would be biendian; it is an unnecessary restriction.

Again, having said that, there is in every hardware implementation of the CPU somewhere where memories are accessed and data presented over a bus. If this bus is 32 bits wide, with a 32-bit word address, and a byte write is being performed then a particular subsection of the bus is expected to be written, probably dependent on a byte-enable signal. To generate that byte-enable signal requires a choice of which byte address corresponds to which byte lane, and that therefore means the hardware is interpreting units of size 8 as an entity of size 32.

What is the approach if the bus is 64 bits wide? What if it is 128 bits wide? Or, for the embedded space, just 16 bits wide?

So the statement in section 2.6 is relevant - but it is effectively a statement about the platform, not the ISA.

I've stated the above as background. I've been building embedded CPUs for over 20 years now, and have had the big-endian/little-endian question many times over, from both a platform and CPU perspective.
One of the current implementations I am responsible for is what used to be the Intel micro engine, which is a 32-bit network processor core. The question often comes up as 'is it big-endian or little-endian'. Well, it isn't either. It is a 32-bit word processor. It does not interpret 32-bit words as bytes. All memory transactions are in the form of 32-bit quantities.
Except... The main memory subsystem of the surrounding platform is 64-bit, and it can *sometimes* by accessed using a 'byte' address. In this case *much* of the operation is done LWBE - little-word-big-byte-endian. The byte endianness *only* matters if transactions (which are in terms of 32-bit quantities) are *not* aligned to a 32-bit word boundary; in fact, most memory transactions in our implementation that support such unaligned transactions support both little- and big-endian understanding of the bottom 2 bits of address and of the data buses- but this is not a processor issue, this is a memory module issue. Yet since the memory is 32-bit word-addressed there has to be an 'endianness of the databus', in terms of which of the 32-bits corresponds to odd 32-bit memory addresses and which to even 32-bit memory addresses (hence the 'little-word' endianness).

Now, Allen's initial questions were, therefore, *really good*, as they were not related to doing much in the processor (the ISA should be agnostic...). He asked:

>The only support that I've heard of for big-endian is the currently defunct BitManipulation WG, and even there the support was for swapping bytes after they've been loaded and before they are store.
>
>a. Is that adequate?
>b. If not, do we expect anyone who wants native BigEndian support to develop their own custom extension?
>c. if not - have there been any discussions for a standard BigEndian discussion?

and his follow-on

>So 4 possible levels of support:
> - none
> - swap instructions
> - BigEndian load/store mode
> - BigEndian load/store instructions

And so my answers (or input to the discussion):

* I think that the platform definition should be explicitly for a fully little endian; there might be an additional option for a fully big-endian platform, but the endianness of 128-bit and 64-bit memory subsystems may need to be explicit.

* To provide extra support for interpretation of data as a different endianness a byte swap instruction is handy.

* To add bit-endian mode one still has to be explicit about what it means. I have also in the past seen three solutions.

1. A pin on the processor; this is just an input to the (data) memory access subsystem to tell it how to interpret addresses, and is inflexible, but could be standardised quite easily (from a hardware perspective).

2. A register in the processor (usually a CSR); this is a 'dynamic' input to the (data) memory access subsystem to tell it how to interpret addresses. This impacts the ISA in the RISC-V terminology (since it defines the CSRs) but perhaps just for particular platforms. This seems do-able

3. An MMU bit in the page tables that identify an endianness; this seems to me to be more complex than is required, since the target would (at best) be to support embedded processor designs which would be of a single endianness throughout.

* To add big-endian load/store instructions one has to be explicit about what this means - is it fully big-endian (128-bit bigendian, 64-bit big-endian, 32-bit big-endian, 16-bit big endian (!) etc). This scares me in the sense that it (as Andrew has said) is a lot of instructions to add. It has been suggested that an 'extended instruction encoding CSR' could be used; this would require saving across interrupts and system calls, and would be a source of considerable bugginess in software (since most of the time it has no effect). And this whole path (of instructions knowing endianness) makes it sound like the *processor* has an endianness, when it really doesn't - it is the memory subsystem and the software.

FWIW the networking space was, going back 10-15 years, wholly big-endian. Cisco's IOS was big-endian only, which meant that there was no way they could utilise any x86 technology. There was an IOS port to big-endian ARM (i.e. a port to the ISA, not a port of endianness), but only for specific ARM designs as most of them did not support big-endian at the time. Cisco eventually moved to be biendian, and they reaped the benefits. x86 never moved... :-)
Nowadays the networking space is (I would estimate) 90% little-endian linux.
I'm not suggesting the embedded space should all jump over, but I would say that making big-endian much of a processor issue would be unnecessary; keep it as a memory subsystem issue, and possibly define mechanisms in a platform specification. And if you like, add a swap instruction coz its nice to have.

--Gavin

Guy Lemieux

unread,
Jun 9, 2018, 10:58:03 AM6/9/18
to ron minnich, Akira Tsukamoto, Allen Baum, Luke Kenneth Casson Leighton, Oleg Endo, RISC-V ISA Dev, Shumpei Kawasaki, jcb6...@gmail.com
excellent post!

this needs to be sticky so everyone can find it. 

On Fri, Jun 8, 2018 at 10:09 PM ron minnich <rmin...@gmail.com> wrote:

as regards something like this, I've never seen a convincing argument re performance that we need to tag data with endian attributes. 

And I'm mentioned as one of the guys who pushed such a bad idea in the now-withdrawn https://standards.ieee.org/findstds/standard/1596.5-1993.html, so in my dark past, I even believed in this kind of thing. Oops.

We did a test a few years back and as of gcc 6, it's pretty smart about turning certain sequences of byte access into single word load/store.  

As regards most code that thinks it needs to be endian-aware, this particular note is useful:

I've found that Rob's note is correct far more often than not. 

i like it!

but it only discusses encoded data steams.

one case it didn’t discuss is how to access peripherals which have endian issues. eg, a 24b DAC with a control register (part of an IP block) using a different endian than the host cpu (a different IP block). these aren’t “data streams”, but require loads and stores to do the right thing.

guy

Madhu

unread,
Jun 9, 2018, 11:42:53 AM6/9/18
to Guy Lemieux, ron minnich, Akira Tsukamoto, Allen Baum, Luke Kenneth Casson Leighton, Oleg Endo, RISC-V ISA Dev, Shumpei Kawasaki, jcb6...@gmail.com
In some cases for these kinds of peripherals, it is simpler to add
some interface logic
to the IP block to do the conversion. We just converted an 80s control system
to RISC-V and had to do this for the peripherals.

Even in networking (especially PPC based), accelerators often do the
low level packet manipulation
and it is only control and exception packets that come to the core.
swap support will suffice for this.
Our team is probably the most affected by this since we have to covert
a whole host of
legacy systems to RISC-V but do not yet see any need for anything more
than swap instructions.

In general do not worship at the altar of legacy support !
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CALo5CZyMuOfqEGZAp%3DWdO9_PME7fOiju_eMgud1uhha023whgw%40mail.gmail.com.



--
Regards,
Madhu

Michael Clark

unread,
Jun 9, 2018, 11:48:09 AM6/9/18
to Andrew Waterman, Luke Kenneth Casson Leighton, Samuel Falvo II, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev, Shumpei Kawasaki


On 9/06/2018, at 9:54 PM, Andrew Waterman <and...@sifive.com> wrote:

On Sat, Jun 9, 2018 at 2:45 AM, Luke Kenneth Casson Leighton
<lk...@lkcl.net> wrote:
On Sat, Jun 9, 2018 at 10:30 AM, Andrew Waterman <and...@sifive.com> wrote:

My point is that there will be such a hit, no matter which approach is
taken.  There's effectively no room in RVC to encode new big-endian
loads and stores.  There's effectively no room in RVI to encode new
big-endian loads and stores with 12-bit offsets.  So, you're left
either wider instructions or two-instruction sequences.

there is another option: the conflict-resolution scheme.  it was
discussed a couple months back, and is "effectively" as if 32-bit (or
other sized) opcodes had been extended (by some hidden bits that are
set with a CSR).

using that scheme the actual meaning of existing opcodes may be
"redirected" to a completely different execution engine, *without*
impact on the pipeline speed or introducing extra latency [1], and,
crucially, allowing the processor to be switched back to "standard"
meanings very very quickly.

I agree that extending the opcode by a few bits will not materially
exacerbate decode latency.

But this issue isn't anywhere near important enough to merit such an
elaborate strategy.  Either Sam's or my/Tommy's solution is
sufficient.

We need a simple BSWAP and it’s quite important based on the amount of code required to swap a 32-bit or 64-bit word on RISC-V presently.

Load store instructions are a nice to have but the relative improvement on a BSWAP instruction is minor.

Compiler attributes are *incredibly* hard to implement. GCC has __attribute__(( scalar_storage(“big-endian”)) but there are all sorts of restrictions due to various complexities such as [what if I take the address of a pointer to a word that is of this endianness and pass it to a function that takes a pointer to a word]. Someone from Intel wrote about implementing bi-endian support in ICC on the LLVM mailing list and the conclusion was “don’t do it”.

Most of the use cases for “portable code” are covered by having a fast instruction for the built ins i.e. __builtin_bswap16, __builtin_bswap32 and __builtin_bswap64 [noting that bswap32 will be very frequent on RV64 in both crypto and network code]. The remainder is supported by idiomatic lifting of swap patterns. i.e. the compiler can detect open coded swaps and lift them into dedicated instructions: https://cx.rv8.io/g/ucLL1v

I would stress that I don’t think we should penalise 32-bit swaps on RV64. We already have some cases where the compiler doesn’t work well with 32-bit types such as int. I’m also still seeing lots of redundant sign extensions and missed shift coalescing opportunities (from sign or zero extension expansion that happen after shift coalescing passes) from the current GCC versions.

We also need __builtin_clz(ll), __builtin_ctz(ll), __builtin_popcount(ll) and rotates.

The rest of the more obscure bit manipulation stuff is quite powerful and very interesting but there is simply no large base of code that uses or will benefit from it. I’m fine with BSWAP being implemented as GREVI, but if I had to choose between that or losing CTZ, I’d favour CTZ simply because I can find a lot more code in the wild that actually uses CTZ and very little that uses GREVI (despite all of the theoretical uses it has). Bit reversal doesn’t show up in typical code also and ANDC is essentially trying to extend the Base ISA. i.e. is 2 instructions and could be macro-op fused.

ron minnich

unread,
Jun 9, 2018, 11:53:24 AM6/9/18
to Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
On Sat, Jun 9, 2018 at 8:42 AM Madhu <ma...@macaque.in> wrote:

In general do not worship at the altar of legacy support !


Yes. In way too many cases (the point I was trying to make with my note, driven from Rob's note) the endian issue is something people worry about that's almost always not worth worrying about, that can be easily addressed without modifying compilers and adding modes to CPUs.  I've seen (fixed) so much code just by removing all the broken attempts to deal with endianness, and the fix is almost always to make it NON-endian aware.

e.g., if you literally have this:
char *a;
uint32_t b;
b = a[0] | a[1]<<8 | a[2] << 16 | a[3] << 24;

because you are doing some kind of endian conversion, we've seen that the compilers are so smart now they'll turn that into a word load if the endianness allows it. I expect the compilers are reasonably smart if you give them the kind of instructions that Andrew mentioned. I'm just not convinced that we need to extend the compiler to allow endianness tags, and further add a bunch of extensions for bi-endianness.

It took me too long but I finally realized it by the end of the 90s.

So do people have the hard numbers, driven from measurement, that show this is a big problem? Or just seems to be a big problem?

ron

Samuel Falvo II

unread,
Jun 9, 2018, 12:34:17 PM6/9/18
to Gavin Stark, RISC-V ISA Dev, Andrew Waterman, Allen Baum, Jacob Bachmeyer, Shumpei Kawasaki
On Sat, Jun 9, 2018 at 4:39 AM, Gavin Stark <atthec...@gmail.com> wrote:
> For the data side I don't believe there is an endianness issue in the ISA,
> with one caveat: section 2.6 states ' RV32I provides a 32-bit user address
> space that is byte-addressed and little-endian.' Without this statement
> RISC-V would be biendian; it is an unnecessary restriction.

If I understood your message correctly, I think this turns out to be a
necessary restriction if you wish to reconcile how to lay out
instructions in a Von Neumann architecture machine. It is also a
requirement if you stipulate binary compatibility across RISC-V
implementations.

> So the statement in section 2.6 is relevant - but it is effectively a
> statement about the platform, not the ISA.

Here's a great reason why section 2.6 makes perfect sense for
belonging in the ISA. Observe that all units are naturally aligned,
and even type-safe (e.g., 32-bit words are *only* accessed with LW/SW,
etc.).

I can write a compiler that translates C into RISC-V assembly
language. C lacks any explicit keywords for specifying big- or
little-endian numbers; a long is a long, and an int is an int, etc.
These correspond to the natural representation of these in memory,
where natural is often taken to mean most run-time efficient, or put
another way, requiring the least amount of instructions to manipulate.
Naturally, that's exactly the kind of code my C compiler will produce.

This compiler can be made completely portable across big- and
little-endian RISC-V variants. And, provided software built from this
compiler runs *only* on the same platform on which the compiler itself
runs, you'll never notice any difference between a big- and
little-endian processor. For all intents and purposes, RISC-V is thus
a "portable" architecture.

The issue comes when I want to run my software (built on a
little-endian RISC-V) on your computer (a big-endian RISC-V
processor). In the best case scenario, the processor will throw an
illegal instruction trap on the very first instruction it looks at,
because my program's instruction layout will be different from what
your hardware expects. In the worst case scenario, my generated code
will have an instruction which just *happens* to form an
unintended-but-valid instruction encoding for your processor. This
possibility guarantees that even software emulation through repeated
illegal instruction traps is not a viable solution, and will lead to
wrong code being executed.

This violates the requirement that all RISC-V implementations support
the unprivileged instruction set corresponding to its XLEN. Ergo, a
firm decision on endianness *is* required to establish compatibility
guarantees between different implementations of the ISA, regardless of
specific platform the ISA is used with.

> * To add big-endian load/store instructions one has to be explicit about
> what this means - is it fully big-endian (128-bit bigendian, 64-bit
> big-endian, 32-bit big-endian, 16-bit big endian (!) etc). This scares me in
> the sense that it (as Andrew has said) is a lot of instructions to add. It

To be clear, you're almost certainly going to need multiple swap
instructions too, especially so as to minimize the overhead of fusion
logic in the instruction decoder; HSWAP to swap the lower bytes of a
halfword, WSWAP to swap the lowest four bytes of a 32-bit word, DSWAP
for the lowest 8-bytes of a 64-bit dword, and so forth.

Allen Baum

unread,
Jun 9, 2018, 12:57:29 PM6/9/18
to Samuel Falvo II, Gavin Stark, RISC-V ISA Dev, Andrew Waterman, Jacob Bachmeyer, Shumpei Kawasaki
The cases you outline are not the problematic ones. The nastiness is dealing with data outside the platform, e.g. network traffic that comes in a specific endian mode that your chip has no control over and isn’t aligned, or data structures that aren’t aligned.
It’s ugly legacy code and protocols- the kind that are deeply entrenched and you don’t get to modify.

-Allen

Samuel Falvo II

unread,
Jun 9, 2018, 1:48:56 PM6/9/18
to Allen Baum, Gavin Stark, RISC-V ISA Dev, Andrew Waterman, Jacob Bachmeyer, Shumpei Kawasaki
On Sat, Jun 9, 2018 at 9:57 AM, Allen Baum <allen...@esperantotech.com> wrote:
> The cases you outline are not the problematic ones. The nastiness is dealing with data outside the platform, e.g. network traffic that comes in a specific endian mode that your chip has no control over and isn’t aligned, or data structures that aren’t aligned.

I fail to see how my example fails to meet your criteria. It's almost
the poster example. A binary compiled on little-endian chip A that
claims to be RISC-V compatible being made to work on a big-endian chip
B also claiming to be RISC-V compatible.

Jim Wilson

unread,
Jun 9, 2018, 4:33:18 PM6/9/18
to Michael Clark, Andrew Waterman, Luke Kenneth Casson Leighton, Samuel Falvo II, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev, Shumpei Kawasaki
On Sat, Jun 9, 2018 at 8:47 AM, Michael Clark <michae...@mac.com> wrote:
> Compiler attributes are *incredibly* hard to implement. GCC has
> __attribute__(( scalar_storage(“big-endian”)) but there are all sorts of
> restrictions due to various complexities such as [what if I take the address
> of a pointer to a word that is of this endianness and pass it to a function
> that takes a pointer to a word]. Someone from Intel wrote about implementing
> bi-endian support in ICC on the LLVM mailing list and the conclusion was
> “don’t do it”.

Intel wrote their bi-endian ICC support for Cisco. I was at Cisco at
the time. It took 5 years of work from a team of people, both ICC
developers and Cisco IOS developers, to make this work. It is pretty
amazing that they got it working, that they can compile a 100M line
big-endian code base for little endian x86 and it works. But it was a
tremendous amount of work, and was a nightmare to maintain, both in
ICC and in the Cisco IOS code. Glibc and the kernel of course are
still little-endian, so everything that travels between the app and
glibc/kernel needs to be byte swapped. For good performance they have
to optimize all of the endian stuff, avoiding byte swaps when
possible. The result is that any variable may be either big or little
endian, and may be different endiannesses at different places in the
code, and sometimes is both endiannesses at the same time stored in
two places. Sometimes we'd get a bug report and we would have to
study it for a week before we could figure out if it was a compiler
bug or an application bug. And there is a lot of stuff that will
never work right in a bi-endian compiler. FP can't easily be byte
swapped. Functions with va_list arguments like vfprintf can't be byte
swapped because we don't know the arg types and sizes. The standard
C++ name mangling scheme doesn't handle endianness, so you can't
easily mix big and little endian C++ code with templates, at least not
without an ABI breaking name mangling change. Etc. All of that stuff
had to be worked around in the Cisco IOS code base. Intel of course
patented all of the good ideas that come up with while implementing
this in ICC, so doing the same in another compiler will be difficult
without violating Intel patents.

Another anecdote, PPC is big endian by default, but about 15 years ago
IBM published a new 64-bit little endian PPC ABI. Why you ask?
Because Google started with x86 processors, and decided that they
wanted to add PPC support also, but couldn't figure out how to fix
their code, and so they convinced IBM to start shipping little endian
PPC systems instead because it was easier. If Google can't figure out
how to make their code endian neutral, then there isn't much hope that
other companies will be able to do this.

Of course both the Cisco and Google code can be fixed. At Cisco, it
took several years to convince management to let me try as a part time
project, and I managed to get BGP working with about 6 weeks of work
over a 3 month period. That required getting the entire IP stack
working first. But then they took the project away from me and gave
it to people who claimed that they had a better idea. They then
failed miserably because they didn't quite understand what they were
doing, and so Cisco killed the project. All along, there were people
who were afraid that trying to fix the code would break it, that was a
major hurdle I never quite got over. Another big part of the problem
is lack of long term planning. I told them it was a 5 year project at
the start, and they said they couldn't wait 5 years for a solution,
and that they didn't think it should take 5 years to fix. But 5 years
later they were still looking for a solution, because every attempt to
solve it other than the ICC bi-endian compiler project had failed.
And that one worked only because they gave it the 5 years it needed to
succeed.

Anyways, if you want serious penetration into the big-endian market,
you are going to need a bi-endian processor. You can leave code
little-endian like ARM and just swap the data accesses.

Jim

Jim Wilson

unread,
Jun 9, 2018, 4:53:12 PM6/9/18
to ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
On Sat, Jun 9, 2018 at 8:53 AM, ron minnich <rmin...@gmail.com> wrote:
> e.g., if you literally have this:
> char *a;
> uint32_t b;
> b = a[0] | a[1]<<8 | a[2] << 16 | a[3] << 24;

In real world code it is a lot more complicated than this. In the
networking code I was working on at Cisco, it was written to minimize
the memory usage, and to minimize data copying to improve performance.
So there was a lot of pointer type casting and unions so that data in
a network packet could be parsed and used in place without a copy.
This works fine on a big-endian processor because IP network packets
are sent in big-endian order. But on a little endian processor this
requires some careful byte swapping and careful redefinition of all of
the structures and unions. There were also some internal data
structures that were a challenge. They had code to store IPv4
addresses in a variable size structure, storing only the prefix,
because depending on the network class you only needed 1, 2, or 3
bytes. This worked fine on a big-endian processor, but little-endian
got complicated because there was no easy way to figure out the number
of bytes to read, since decoding the network class required access to
the high-order big-endian byte first. I had to byte swap IPv4
addresses to big-endian before storing them in this data structure to
make it work. There were a lot of different problems I had to find
solutions for while working on the IOS code.

Jim

Shumpei Kawasaki

unread,
Jun 9, 2018, 6:00:56 PM6/9/18
to Jim Wilson, Michael Clark, Andrew Waterman, Luke Kenneth Casson Leighton, Samuel Falvo II, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev

We find this thread initiated by Allen very timely. 

RISC-V bi-endianess would offer: (1) automagical OS and app porting from one arch to another, and (2) potential performance gain in network specific code (e.g. 5x even after SWAP instructions). Significant amount of cross tool work ahead in bringing bi-endianness to RISC-V. Compiler engineers checked whether someone is working on big endian support for GCC. It looked like there is nothing for RISC-V big-endian presently. The first thing is to figure out the ELF format and work with the community. We estimate 1,000 hours of software and hardware engineering work to perform RISC-V bi-endianness feasibility study. 

Japan used to ship 80% of "embedded systems." Now down to 20% or less. Japan's infrastructure code like automatic train control (ATC) for bullet trains, municipal trains, control code for nuclear power plants, and the JR's railroad network system predating ethernet. These code is 100% big-endian. Porting over the legacy code to RISC-V will require due due due dilig. 

Japanese government agencies considers RISC-V as a key to upgrading their social infrastructure. Japanese cabinet has a program Society 5.0 to improve QOL (quality of life) for both incoming immigrants and aging population through deploying cyber-physical systems. Hitachi became a RISC-V member two weeks ago. Hitachi plans to disclose their specific Society 5.0 programs on June 13. 

We would like to know everything about RISC-V bi-endianness by October. 

This discussion is valuable.

Luke Kenneth Casson Leighton

unread,
Jun 9, 2018, 6:12:25 PM6/9/18
to Michael Clark, Andrew Waterman, Samuel Falvo II, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev, Shumpei Kawasaki
On Sat, Jun 9, 2018 at 4:47 PM, Michael Clark <michae...@mac.com> wrote:

> Most of the use cases for “portable code” are covered by having a fast
> instruction for the built ins i.e. __builtin_bswap16, __builtin_bswap32 and
> __builtin_bswap64 [noting that bswap32 will be very frequent on RV64 in both
> crypto and network code].

> [...]

> We also need __builtin_clz(ll), __builtin_ctz(ll), __builtin_popcount(ll)
> and rotates.

these and GREVI are all in xBitManip, by clifford wolf, it's
extremely well-designed and well thought-through:
https://github.com/cliffordwolf/xbitmanip

the mailing list for discussion, contributions and questions is here:
https://groups.google.com/forum/#!forum/riscv-xbitmanip

l.

ron minnich

unread,
Jun 9, 2018, 6:56:35 PM6/9/18
to Jim Wilson, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
On Sat, Jun 9, 2018 at 1:53 PM Jim Wilson <ji...@sifive.com> wrote:
On Sat, Jun 9, 2018 at 8:53 AM, ron minnich <rmin...@gmail.com> wrote:
> e.g., if you literally have this:
> char *a;
> uint32_t b;
> b = a[0] | a[1]<<8 | a[2] << 16 | a[3] << 24;

In real world code it is a lot more complicated than this. 

The entire plan 9 ip stack was written to rules like this and it lived for many years in the real world. The word endian almost never appears, and in fact I just checked, and it appears not once in the IP stack. 

It takes a lot of care to write code that is endian-independent without sprinkling #ifdef and other such nasty stuff all over the place, but Plan 9 is proof of concept that it's doable, and it's been running for 20 years in many appliances near you. And, as it happens, Go picked these ideas up, and it's real world code, and people are learning these lessons. I also brought this model into parts of coreboot, runnng in about 30M chromebooks near you, and it works there too. It's a real effort to stamp out all instances of #ifdef LITTLE_ENDIAN but we're getting there. And some of the worst bugs I've fixed have been people messing up things like ntohl and htonl -- one of them was there for 6 years and nobody realized it.

So, I can accept that real world code may be more complicated, but I'm not willing to accept that in all cases it should be.  Frequently, code with endian-awareness bits in it is badly written, and that includes almost every bit of code I've ever seen with the words BIG_ENDIAN or LITTLE_ENDIAN in it :-)

Which, to close the loop, means I'd still like to hear the quantitative argument for adding all this stuff to gcc and riscv. Has the need been measured or are we still flying on anecdote?

ron

Jim Wilson

unread,
Jun 9, 2018, 7:45:57 PM6/9/18
to ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
On Sat, Jun 9, 2018 at 3:56 PM, ron minnich <rmin...@gmail.com> wrote:
> The entire plan 9 ip stack was written to rules like this and it lived for
> many years in the real world. The word endian almost never appears, and in
> fact I just checked, and it appears not once in the IP stack.

It is certainly possible to write endian neutral code from the start.
I never said otherwise. Most GNU code is endian neutral. But the
reality is that few people do that. As a compiler guy, I've seen a
lot of real world code, and most of it is badly written.

Plan 9 isn't representative of real world code. It was written in a
research lab, Bell Labs, by people who had already done significant OS
and compiler work. Most real world code is written by people that are
only just competent enough to do their jobs, while working against too
short deadlines that don't give them enough time to do the work the
best way possible. So you end up with lots of endianness problems.
And by the time they realize that they have a problem, they have a
large code base that has already been production qualified, and they
can't afford to rewrite it to make it endian neutral.

That is why people think the easiest solution is to "fix" the
compiler, or to fix the hardware. Fixing the compiler is a crazy
idea, though to be fair, in Intel's case, considering how complicated
the x86 ISA is, fixing the compiler might have been the easiest
solution for them.

Jim

Jacob Bachmeyer

unread,
Jun 9, 2018, 8:34:54 PM6/9/18
to ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
That code effectively *is* an endianness tag, just obfuscated a bit to
fit in standard C. Further, having word-access instructions for both
byte orders would mean that GCC can *always* convert those expressions
to word loads if they actually encode a word load, rather than
generating the full sequence of N byte loads, N-1 shifts, and N-1
logical OR if the processor has the "other" endianness. On the other
hand, *supressing* this conversion to word load and forcing the
long-form sequence to be generated may also be useful in RISC-V
networking, since words in a network packet may often be unaligned and
RISC-V hardware may trap on unaligned word accesses.


-- Jacob

Jacob Bachmeyer

unread,
Jun 9, 2018, 8:55:34 PM6/9/18
to Luke Kenneth Casson Leighton, Andrew Waterman, Samuel Falvo II, Allen Baum, RISC-V ISA Dev, Shumpei Kawasaki
The extensible assembler database I proposed assumes that each processor
will have *one* and *only* one set of recognized instructions. (The
"hidden prefix" is the immutable vendor/arch/impl tuple in my
proposals.) In other words, it moves the conflict-resolution to the
RISC-V ecosystem as a whole, allowing different implementations to have
different (possibly overlapping) non-standard extensions, while avoiding
"hidden state" in the instruction decoder. (Which simplifies analysis
of RISC-V binaries. If (as we all want) RISC-V becomes popular, there
*will* eventually be RISC-V malware that will need to be
reverse-engineered. Multiple interpretations (as arranged for by the
"conflict resolution" proposals) will only complicate this and possibly
delay discovery of such malware. Consider a worm that targets some kind
of IoT device with particular unusal extensions and uses those to avoid
executing in honeypots that do not have them. Crashes when such a worm
infests a non-targeted device are much more obvious then "oh, this
processor does not support instruction set XYZ, so it is not the IoT
device I seek; lay low and spread more".)

If big-endian memory access is standardized as LOAD/GREV and GREV/STORE
fusion pairs, represented with assembler pseudo-instructions, then the
extensible assembler database I propose would permit those
pseudo-instructions to be overridden for a specific implementation that
might, for example, choose to use CUSTOM-0 and CUSTOM-1 as big-endian
LOAD and STORE. The compiler simply generates "LW.BE" and the assembler
either expands that to a LOAD/GREV pair or an implementation-specific
encoding, which may be a non-standard 32-bit opcode.


-- Jacob

Madhu

unread,
Jun 9, 2018, 9:19:50 PM6/9/18
to Jim Wilson, ron minnich, Guy Lemieux, Allen Baum, RISC-V ISA Dev
I agree you will end up fixing the HW. but you cannot get away without
GCC support.
The question is what degree
of HW change can we live with.
In all our control systems use cases, I doubt whether we will be
allowed to fix the code.
The first design we completed was the control system of a fast breeder reactor,
so they were a little leery on even switching to RISC-V from a 68020
! But code was
relatively clean and the verification infrastructure was of course pretty good.

The issues that arise will not be technical in systems like these, the
systems are
too critical (next in line for us is launch vehicles) for any code changes to be
contemplated. Verification infrastructure may not always in place. In those
cases, without some semblance of HW/SW bi-endian support, we will face
political problems
like the Shumpei pointed out.

My primary concern is the effect of bi-endian support in our high
perf. OO cores.
As we discovered in our last tapeout even at 500 Mhz on
a 22 nm, trivial little additions can make timing closure difficult.
But I guess there we could just support LE. Presumably
bi-endian will be optional.
--
Regards,
Madhu

ron minnich

unread,
Jun 9, 2018, 10:04:09 PM6/9/18
to jcb6...@gmail.com, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
On Sat, Jun 9, 2018 at 5:34 PM Jacob Bachmeyer <jcb6...@gmail.com> wrote:


That code effectively *is* an endianness tag, just obfuscated a bit to
fit in standard C. 

actually, it's not: it's the same code without regard to endian-ness. 

I'll stop here, I realize this is an argument that won't reach closure; I used to believe this kind of tag would be useful, but that beautiful theory was murdered by a brutal gang of facts :-)

but I'd still like to see the measurements that justify the work.

ron
 

Luke Kenneth Casson Leighton

unread,
Jun 9, 2018, 10:20:14 PM6/9/18
to Jacob Bachmeyer, Andrew Waterman, Samuel Falvo II, Allen Baum, RISC-V ISA Dev, Shumpei Kawasaki
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68


On Sun, Jun 10, 2018 at 1:55 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Luke Kenneth Casson Leighton wrote:
>>
>> On Sat, Jun 9, 2018 at 10:45 AM, Luke Kenneth Casson Leighton
>> <lk...@lkcl.net> wrote:
>>
>>>
>>> On Sat, Jun 9, 2018 at 10:30 AM, Andrew Waterman <and...@sifive.com>
>>> wrote:
>>>
>>>
>>>>
>>>> My point is that there will be such a hit, no matter which approach is
>>>> taken. There's effectively no room in RVC to encode new big-endian
>>>> loads and stores. There's effectively no room in RVI to encode new
>>>> big-endian loads and stores with 12-bit offsets. So, you're left
>>>> either wider instructions or two-instruction sequences.
>>>>
>>>
>>> there is another option: the conflict-resolution scheme. it was
>>> discussed a couple months back, and is "effectively" as if 32-bit (or
>>> other sized) opcodes had been extended (by some hidden bits that are
>>> set with a CSR).
>>>
>>
>>
>> p.s. jacob already came up with a corresponding / matching scheme for
>> compilers / binutils, which takes the hidden prefix into account and
>> walks it through from gcc to binutils to actual assembler.
>>
>
>
> The extensible assembler database I proposed assumes that each processor
> will have *one* and *only* one set of recognized instructions. (The "hidden
> prefix" is the immutable vendor/arch/impl tuple in my proposals.)

ah this is an extremely important thing to clarify, the difference
between the recognised instruction assembly mnemonic (which must be
globally world-wide accepted as canonical) and the binary-level
encodings of that mnemonic used different vendor implementations which
will most definitely *not* be unique but require "registration" in the
form of atomic acceptance of a patch (to the extensible assembler
database that jacob mentions) by the FSF to gcc and binutils [and
other compiler tools].

l.

Jacob Bachmeyer

unread,
Jun 9, 2018, 10:27:04 PM6/9/18
to ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
ron minnich wrote:
> On Sat, Jun 9, 2018 at 5:34 PM Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> That code effectively *is* an endianness tag, just obfuscated a
> bit to
> fit in standard C.
>
>
> actually, it's not: it's the same code without regard to endian-ness.

It is the same code regardless of processor byte order, yes, but it very
specifically tags (the combination of array offsets and shift amounts)
the data byte order, which is *why* it works regardless of processor
endianness. I propose a similar explicit tag applicable to structure types.

In other words, I argue that byte order can be thought of as a component
of type information associated with a value, not unlike the width of
that value. Put yet another way, be_uint32_t and le_uint32_t can be
considered separate types, although systems in practice use one as
"uint32_t" and have no direct support for the other at all. For
little-endian processors, this is sub-optimal for networking, since
"network byte order" is big-endian. (And there are cases where this is
objectively correct: big-endian base-128 varints can be read with the
same amount of effort regardless of processor endianness (or width!) in
a simple loop. The little-endian equivalents are ... much more
difficult to process portably.)


-- Jacob

Luke Kenneth Casson Leighton

unread,
Jun 9, 2018, 10:42:39 PM6/9/18
to Madhu, Jim Wilson, ron minnich, Guy Lemieux, Allen Baum, RISC-V ISA Dev
On Sun, Jun 10, 2018 at 2:19 AM, Madhu <ma...@macaque.in> wrote:

> In all our control systems use cases, I doubt whether we will be
> allowed to fix the code.
> The first design we completed was the control system of a fast breeder reactor,
> so they were a little leery on even switching to RISC-V from a 68020
> ! But code was
> relatively clean and the verification infrastructure was of course pretty good.

ye gods! not very many of those in the world: you'd not be selling
huge numbers of RISC-V processors there but... dang. so, um, let me
put it this way: normally the argument would go "well that's a tiny
use-case, not many of those so honestly we really don't care" well um
actually in this case we *do* care because if a nuclear reactor goes
"bang" because of a BE/LE screw-up it kiiinda affects the entire
planet.

> The issues that arise will not be technical in systems like these, the
> systems are
> too critical (next in line for us is launch vehicles) for any code changes to be
> contemplated. Verification infrastructure may not always in place. In those
> cases, without some semblance of HW/SW bi-endian support, we will face
> political problems
> like Shumpei pointed out.

so these are insanely-critical use-cases with very low numbers, where
mistakes simply cannot be made. i learned the lesson from
reverse-engineering that you *do not* make more than one change at a
time, because if you do the number of "unknowns" in the question "ok
which change broke things' rises from just 2 to 2^N where N is the
number of changes.

so the usual argument "you can just change the code" is, in these
mission-critical environments, absolutely flat-out "no you absolutely
cannot". the bug might be in the hardware, the bug might be in the
software, and the nightmare scenario for security and formal
verification is when you *DON'T* know that things are broken.

*AFTER* a change to RISC-V bi-endian, *THEN* after maybe 20 years of
verification of the software it *MIGHT* be possible to switch from a
BE software model to a LE software model...

... but you have a transition here where you *need* to get away from
the BE HW of the 68020, so bi-endian is critical:

* swapping to another BE system you're still stuck
* staying on the existing BE system is a legacy race against time
* swapping to a LE system is not safe as that's a SW change *and* a HW change.

therefore swapping to bi-endian is essential.

question for you madhu: do there happen to be any *other* use-cases -
larger volumes - where bi-endian RISC-V HW could be tested out and
verified *before* deployment into these mission-critical environments?
a piece of factory equipment going "bang" is a lot less damage than a
rocket or a nuclear reactor going "bang" ;)

l.

Alex Elsayed

unread,
Jun 9, 2018, 11:08:17 PM6/9/18
to Jacob Bachmeyer, ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
I look forward to the day that is implemented in the major compilers and standardized in ISO 9899:20XX, in whatever order that occurs, but that is a matter for those groups rather than RISC-V IMO.

No part of that proposal is specific to RISC-V, and in fact, it is specific to things such as C that do _not_ encompass the whole of RISC-V. What about Rust, and Haskell, and C++, and Java, and Kotlin, and Swift, and Go, and...?

Albert Cahalan

unread,
Jun 9, 2018, 11:57:18 PM6/9/18
to Jim Wilson, Michael Clark, Andrew Waterman, Luke Kenneth Casson Leighton, Samuel Falvo II, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev, Shumpei Kawasaki
On 6/9/18, Jim Wilson <ji...@sifive.com> wrote:

> And there is a lot of stuff that will
> never work right in a bi-endian compiler. FP can't easily be byte
> swapped. Functions with va_list arguments like vfprintf can't be byte
> swapped because we don't know the arg types and sizes. The standard
> C++ name mangling scheme doesn't handle endianness, so you can't
> easily mix big and little endian C++ code with templates, at least not
> without an ABI breaking name mangling change.

You're changing the ABI anyway, obviously, to big-endian.
You only had troubles with va_list because you stangely
insisted on passing that from one endianness to another.
It's not the compiler that is trouble. It's the fact that you
were trying to link two different ABIs into a single binary.

In other words, it was like the Win16 to Win32 thunks that
people had to write in the bad old days.

You asked for hurt, and you got it.

That said, passing a va_list from one endianness to another is
perfectly doable if your ABI specifies that va_list have the needed
information. The compiler can generate this, even making the
calling convention for "va_list" and "..." functions identical.
Such an ABI might have a va_list point to DWARF data. It'd have
the bonus of being able to detect all sorts of trouble at runtime.

Jim Wilson

unread,
Jun 10, 2018, 12:24:32 AM6/10/18
to Albert Cahalan, Michael Clark, Andrew Waterman, Luke Kenneth Casson Leighton, Samuel Falvo II, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev, Shumpei Kawasaki
On Sat, Jun 9, 2018 at 8:57 PM, Albert Cahalan <acah...@gmail.com> wrote:
> You're changing the ABI anyway, obviously, to big-endian.
> You only had troubles with va_list because you stangely
> insisted on passing that from one endianness to another.
> It's not the compiler that is trouble. It's the fact that you
> were trying to link two different ABIs into a single binary.

This is unavoidable. You can't have a big-endian kernel on an x86
system. You might be able to have a big-endian glibc, but it isn't
practical to try. So given that the app is big-endian and glibc is
little-endian, you have no choice but to have two ABI's in a single
binary. Since the big-endian support requires instrumenting code,
basically the only code that is big-endian is the Cisco IOS code. All
system libraries including glibc are little-endian, all third-party
libraries are little-endian, etc.

> That said, passing a va_list from one endianness to another is
> perfectly doable if your ABI specifies that va_list have the needed
> information. The compiler can generate this, even making the
> calling convention for "va_list" and "..." functions identical.
> Such an ABI might have a va_list point to DWARF data. It'd have
> the bonus of being able to detect all sorts of trouble at runtime.

You can't change va_list, because that breaks glibc. Not unless you
can get all linux distro vendors to agree to rebuild all of their code
using a new bi-endian-friendly x86 ABI, which is unlikely to happen.

Consider for instance how qsort works. A big-endian app calling a
little-endian qsort knows that it needs to byte swap arguments. But
then the little-endian qsort that Red
Hat/Suse/Ubuntu/Debian/MontaVista/whoever provided has to call the
comparison function hook provided by the big-endian app, and these
little-endian qsorts know nothing about big-endian code. The only way
this can work is if you put a little-endian comparison function in the
middle of your big-endian code, and pass that to qsort. This little
endian comparison function can then do the byte swapping necessary to
use big-endian data.

The important thing to remember here is that the bi-endian icc is
designed to work on any x86 system without change, so you can't avoid
little-endian code, and you can't change the little-endian ABI.

Jim

Jacob Bachmeyer

unread,
Jun 10, 2018, 12:24:33 AM6/10/18
to Alex Elsayed, ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
Alex Elsayed wrote:
>
> On Sat, Jun 9, 2018, 19:27 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> ron minnich wrote:
> > On Sat, Jun 9, 2018 at 5:34 PM Jacob Bachmeyer
> <jcb6...@gmail.com <mailto:jcb6...@gmail.com>
GCC has a variety of processor-specific features; this kind of tagging
model has to start somewhere. Why not on RISC-V to support a
bi-endian-at-runtime model unique to RISC-V? The issue of such support
was raised on this list, so at least someone other than me considers it
important and relevant to RISC-V. I proposed a solution that involves
adding L?.BE and S?.BE pseudo-instructions (that can be actual
instructions on some implemenations as a non-standard extension).
Adding these tags to C is therefore compiler support for those instructions.

Further, "be_uint32_t" and "le_uint32_t" were intended as examples to
illustrate the concept, not suggestions that those should be added to
C. Some form of endianness tagging can be added to any language; I
actually favor GCC type attributes "big_endian" and "little_endian", or
perhaps byte_order("msb-first") and byte_order("lsb-first"), which also
suggest encodings for less-common byte orders. These attributes could
be applied to structure types and affect the contained fields, which is
almost ideal for networking applications.


-- Jacob

Jim Wilson

unread,
Jun 10, 2018, 12:38:37 AM6/10/18
to Jacob Bachmeyer, Alex Elsayed, ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
>> In other words, I argue that byte order can be thought of as a
>> component
>> of type information associated with a value, not unlike the width of
>> that value.

In bi-endian icc, it is a type qualifier, like const and volatile.

> Further, "be_uint32_t" and "le_uint32_t" were intended as examples to
> illustrate the concept, not suggestions that those should be added to C.
> Some form of endianness tagging can be added to any language; I actually
> favor GCC type attributes "big_endian" and "little_endian", or perhaps
> byte_order("msb-first") and byte_order("lsb-first"), which also suggest
> encodings for less-common byte orders. These attributes could be applied to
> structure types and affect the contained fields, which is almost ideal for
> networking applications.

GCC has scalar_storage_order, which is available as an attribute, a
pragma, and a command line option. Try for instance

struct foo
{
int i;
} __attribute__ ((scalar_storage_order ("big-endian")));

struct foo bar;

int
sub (int i)
{
bar.i = i;
}

I get this code:

sub:
addi sp,sp,-16
sd ra,8(sp)
call __bswapsi2
lui a5,%hi(bar)
sw a0,%lo(bar)(a5)
ld ra,8(sp)
addi sp,sp,16
jr ra

You will not get efficient code this way. If you want efficient code,
you need bi-endian aware optimization passes, and this gets very
complicated very quickly.

Jim

Jacob Bachmeyer

unread,
Jun 10, 2018, 12:42:31 AM6/10/18
to Luke Kenneth Casson Leighton, Andrew Waterman, Samuel Falvo II, Allen Baum, RISC-V ISA Dev, Shumpei Kawasaki
Neither mnemonics nor encodings are globally unique: each
implementation picks a subset with no overlaps that it will support.
Mnemonics cannot be globally unique: RVXfoov1 and RVXfoov2 are
logically distinct extensions, but could reasonably have overlapping
mnemonics. Non-standard encodings cannot be globally unique: there is
no authority that can allocate them. Where mnemonics overlap, programs
cannot simultaneously use both extensions. Where encodings overlap,
implementations cannot support both extensions (without renumbering one
or both, which the extensible assembler database facilitates).

Note that, in the example case of RVXfoov1 and RVXfoov2 with overlapping
mnemonics, while modules must be assembled for one or the other (this
adds a fourth element to the extensible assembler database key tuple:
vendor/arch/impl/profile), an implementation could support *both* by
giving the two versions disjoint encodings. Another option is to allow
programs to add prefixes to extension instruction mnemonics, akin to XML
tag namespacing, but this is probably a bit much, particularly since
mnemonic conflicts can be avoided on a per-module basis and I do not
believe that there is anything similar in GNU as for any other architecture.


-- Jacob

Jacob Bachmeyer

unread,
Jun 10, 2018, 12:54:36 AM6/10/18
to Jim Wilson, Alex Elsayed, ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
Jim Wilson wrote:
>> Further, "be_uint32_t" and "le_uint32_t" were intended as examples to
>> illustrate the concept, not suggestions that those should be added to C.
>> Some form of endianness tagging can be added to any language; I actually
>> favor GCC type attributes "big_endian" and "little_endian", or perhaps
>> byte_order("msb-first") and byte_order("lsb-first"), which also suggest
>> encodings for less-common byte orders. These attributes could be applied to
>> structure types and affect the contained fields, which is almost ideal for
>> networking applications.
>>
>
> GCC has scalar_storage_order, which is available as an attribute, a
> pragma, and a command line option.

Excellent! The C support already exists for my proposal.


> Try for instance
>
> struct foo
> {
> int i;
> } __attribute__ ((scalar_storage_order ("big-endian")));
>
> struct foo bar;
>
> int
> sub (int i)
> {
> bar.i = i;
> }
>
> I get this code:
>
> sub:
> addi sp,sp,-16
> sd ra,8(sp)
> call __bswapsi2
> lui a5,%hi(bar)
> sw a0,%lo(bar)(a5)
> ld ra,8(sp)
> addi sp,sp,16
> jr ra
>
> You will not get efficient code this way. If you want efficient code,
> you need bi-endian aware optimization passes, and this gets very
> complicated very quickly.

Another option, and the proposal I advance to meet this issue, is to add
assembler pseudo-instructions for big-endian LOAD/STORE. While these
pseudo-instructions would require an RVB aligned with the current
RVXBitManip proposals (which are intended to eventually become an RVB
proposal), specifically GREVI for byte-swap, an implementation would be
free to directly implement them with non-standard encodings, which would
override the pseudo-instructions in the assembler. The compiler does
not care about this detail.

The resultant code would then be more like: (hand-written assembler)

sub:
LUI a5, %hi(bar)
SW.BE t0, a0,%lo(bar)(a5)
JR ra

Unfortunately, the big-endian STORE needs a temporary that it can
clobber, but this does not affect macro-op fusion because STORE
otherwise does not write to the register file.


-- Jacob

Luke Kenneth Casson Leighton

unread,
Jun 10, 2018, 1:35:13 AM6/10/18
to Jacob Bachmeyer, Andrew Waterman, Samuel Falvo II, Allen Baum, RISC-V ISA Dev, Shumpei Kawasaki
On Sun, Jun 10, 2018 at 5:42 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> Neither mnemonics nor encodings are globally unique: each implementation
> picks a subset with no overlaps that it will support.

darn it, i've got a concept in my head that i want to express and i'm
not using the right words. again. can i ask you a favour, if i start
a new thread could you help me to clarify this properly, particularly
the assembly / etc. aspect as i have put in a (short) talk proposal
for chennai and i need to get it right.

l.

Luke Kenneth Casson Leighton

unread,
Jun 10, 2018, 1:43:01 AM6/10/18
to Jacob Bachmeyer, Jim Wilson, Alex Elsayed, ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
On Sun, Jun 10, 2018 at 5:54 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> The resultant code would then be more like: (hand-written assembler)
>
> sub:
> LUI a5, %hi(bar)
> SW.BE t0, a0,%lo(bar)(a5)
> JR ra
>
> Unfortunately, the big-endian STORE needs a temporary that it can clobber,
> but this does not affect macro-op fusion because STORE otherwise does not
> write to the register file.

oo. ah. right. can i possibly clarify, as this is quite important:
is it a known guaranteed-defined characteristic of macro-op fusion
that the register utilised as an intermediate does NOT modify the main
register file? if it is, it doesn't quite sound right, because a
system *not* having macro-op fusion would act differently from one
that did, and that would be bad.

if it isn't then the instructions needed just got a little more
complicated (because a temporary register is needed). which is
probably why h/w support for BE instructions has traditionally been
added...

l.

Michael Clark

unread,
Jun 10, 2018, 5:16:42 AM6/10/18
to Jim Wilson, Andrew Waterman, Luke Kenneth Casson Leighton, Samuel Falvo II, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev, Shumpei Kawasaki
Agree.

The four points from my point of view:

1). Bi-endian attributes in the compiler are hard and best avoided

2). For better performance of endian aware code thaw does swaps, we need BSWAP[W] instructions, and this is important on either endianness systems.

3). A big-endian mode could be set much like MXL/SXL/UXL where loads and stores for a “big-endian process or OS” would be swapped but the instruction encoding would otherwise be the same, versus adding load and store instructions

4). Byte swapped load and store with swap would take up a lot of space, and perhaps the BSWAP instructions could be macro-op fused pre-store and post-load.

BSWAP instructions and big-endian mode make sense to me. The BSWAP instructions are useful no matter which endianness and a per-mode switch using a MXL/SXL/UXL mechanism would be similar to other processors that switch modes for loads and stores globally. While Linux doesn’t support multiple endian ABIs in the same kernel (it only supports endianness at compile time) it would be possible with sufficient kernel mods to swap pointers and data with endian specific formats such as network socket addresses. That said, latest PowerPC Linux using the ELFv2 ABI is little-endian and it is only older PowerPC systems that are big-endian. i.e. the Summit PowerPC supercomputer running RHEL is little-endian.

A large proportion of the Linux network stack, and OpenSSL and many other open source apps are endian aware and they either use __builtin_bswap(ll) or their swap macro idioms are lifted into BSWAP instructions.

Linux kernel, glibc, musl-libc, OpenSSL, QEMU, and many other open source packages support ppcbe. I remember running Linux on big-endian ppc macs before IBM came out with bi-endian support, and Darwin kernel supported both big and little endian hosts while it still supported PowerPC. Indeed the riscv-qemu port should theoretically run on big-endian systems but it might need some minor modifications.

A big-endian MXL/UXL/SXL type mode would help with adoption for apps that rely on an endianness but are not written in an endian portable manner. i.e. expect loads and stores to be big-endian. Easier than adding loads and stores as it doesn’t require any Base ISA changes, only some privileged ISA mode bits.

Michael Clark

unread,
Jun 10, 2018, 6:46:16 AM6/10/18
to jcb6...@gmail.com, Luke Kenneth Casson Leighton, Andrew Waterman, Samuel Falvo II, Allen Baum, RISC-V ISA Dev, Shumpei Kawasaki
And binary compatibility goes out the window.

For the Unix platform the reality is that a bswap instruction will likely initially have to be used via optimised versions of bswapsi and bswapdi functions in libgcc_s.so via hwcaps or in optimised versions of libraries such as libssl.so that are replaced at runtime by the dynamic linker based on misa (RDISA). The latter, a hwcap substitution for a larger shared library that uses the optimised version of the extension instructions will perform better as the PLT overhead will remove a lot of benefits of having a dedicated instruction for the libgcc_s.so case (much like softfloat ABI using hardfloat instructions on arm). The was also the case for SSE optimised library functions on i386 Linux distros until x86_64 came along which had SSE4.2 as a baseline, however it is still the case for any code that uses any new extensions. Mechanisms such as IFUNCs are also used with GNU compilers to replace functions at runtime using multiple implementations of a function in a single library, using a resolver for the PLT on first call (an alternative to using hwcaps in the dynamic linker).

BSWAP and other B extension intrinsics essentially need to be ubiquitous until distros will compile for heterogeneous RV64GCB processors.

Of course it’s not really an issue for embedded devices where code is compiled from source by device vendors for homogeneous hardware.

It will be an issue for platforms like Android that allow native libraries via the NDK. On arm, one already has to create multiple versions of the binary extension( embedded in the APK, for different ISA revisions. i.e. one can embed x86, x86_64, armeabi-v7a, arm64-v8a and previously mips (removed in NDKr7) in one application archive.

I suspect Android device makers would probably want B as a baseline extension (for binary compatibility) as Android devices perform quite a lot of network and crypto. BSWAP and ROR/ROL would probably be the most frequently called B intrinsics besides CLZ.

Michael Clark

unread,
Jun 10, 2018, 8:15:55 AM6/10/18
to Luke Kenneth Casson Leighton, Jacob Bachmeyer, Jim Wilson, Alex Elsayed, ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
Single endian architectures like x86 have MOVBE a load store instruction (unlike MOV it takes a register and a memory operand). It was introduced on Haswell which is very recent. It is not on all x86_64 CPUs.

Bi-endian architectures tend to at least have a mode switch which applies to all load stores and then a byte swap instruction usually suffices.

PowerPC appears to have load and store byte reversed indexed (it doesn’t have a GREVI operand for bit and nibble reversals as those would be extremely rarely used).

All architectures tend to have a byte swap instructions. BSWAP was added to the Intel 486 in 1989 (~28 years ago).

For macro-op fusion, all that is necessary is for the second instruction to kill the temporary.

Load can be macro-op fused but store cannot because store doesn’t write to a register and thus can’t kill the temporary.

Load:

LW a0, 0(a1)
# a0 gets killed by the next insn
BSWAP a0, a0

Store:

BSWAP t0, a0
SW t0, 0(a1)
# t0 is live so we can’t fuse

The compiler attributes are orthogonal to the whole discussion as they are not part of the ISA nor the ABI.

Sure load store instructions could be added but they present a larger binary compatibility issue as BSWAP can be used via _bswapsi2 and _bswapdi2

As an aside, this reminds me of an issue I found with GCC intrinsics and libgcc on RV64 where di and ti versions are being created with no si version leading to non-optimal codegen on RV64. I think this is with __builtin_clz and __builtin_ctz. When used on uint32_t on RV64 clz promotes to uint64_t and calls clzdi which loops over the first 4 bytes of an int64_t wasting cycles. It seems there is a missing expansion and libgcc version on 64-bit arches for a subset of the 32-bit versions of the builtins on platforms that use the generic versions. This doesn’t affect most other platforms as they have patterns to select the 32-bit instruction. RV64 on the other hand is wasting cycles counting leading zeros for the first 4 bytes which we know are zero. I should raise a bug...

Luke Kenneth Casson Leighton

unread,
Jun 10, 2018, 8:46:11 AM6/10/18
to Michael Clark, Jacob Bachmeyer, Jim Wilson, Alex Elsayed, ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
On Sun, Jun 10, 2018 at 1:15 PM, Michael Clark <michae...@mac.com> wrote:

>> On 10/06/2018, at 5:42 PM, Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:
>>
>>> On Sun, Jun 10, 2018 at 5:54 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>>
>>> The resultant code would then be more like: (hand-written assembler)
>>>
>>> sub:
>>> LUI a5, %hi(bar)
>>> SW.BE t0, a0,%lo(bar)(a5)
>>> JR ra
>>>
>>> Unfortunately, the big-endian STORE needs a temporary that it can clobber,
>>> but this does not affect macro-op fusion because STORE otherwise does not
>>> write to the register file.
>>
>> oo. ah. right. can i possibly clarify, as this is quite important:
>> is it a known guaranteed-defined characteristic of macro-op fusion
>> that the register utilised as an intermediate does NOT modify the main
>> register file? if it is, it doesn't quite sound right, because a
>> system *not* having macro-op fusion would act differently from one
>> that did, and that would be bad.
>>
>> if it isn't then the instructions needed just got a little more
>> complicated (because a temporary register is needed). which is
>> probably why h/w support for BE instructions has traditionally been
>> added...
>
> Single endian architectures like x86 have MOVBE a load store instruction (unlike MOV it takes a register and a memory operand). It was introduced on Haswell which is very recent. It is not on all x86_64 CPUs.

yes i noticed that on the wikipedia page
https://en.wikipedia.org/wiki/Endianness#Bi-endianness - what took 'em
so long?? :)


> Bi-endian architectures tend to at least have a mode switch which applies to all load stores and then a byte swap instruction usually suffices.

this was what i was referring to about the conflict-resolution system:
the conflict-resolution concept is a generalised form of the "mode
switch" concept, which in turn is a variant of the MISA concept, where
the difference is that whilst MISA *only* swaps out instructions, C.R.
and M.S. concepts switch *IN* another instruction *in place* of the
one switched out. just a NOT gate on an extra wire into the demuxer
of another op... *really* not hard to do.

> PowerPC appears to have load and store byte reversed indexed (it doesn’t have a GREVI operand for bit and nibble reversals as those would be extremely rarely used).
>
> All architectures tend to have a byte swap instructions. BSWAP was added to the Intel 486 in 1989 (~28 years ago).
>
> For macro-op fusion, all that is necessary is for the second instruction to kill the temporary.
>
> Load can be macro-op fused but store cannot because store doesn’t write to a register and thus can’t kill the temporary.
>
> Load:
>
> LW a0, 0(a1)
> # a0 gets killed by the next insn
> BSWAP a0, a0
>
> Store:
>
> BSWAP t0, a0
> SW t0, 0(a1)
> # t0 is live so we can’t fuse

ok got it, that's really clear (and also fascinating). thanks michael.

so, that means that whilst LD would be fine to macro-op fuse, ST
would need to either reserve one temporary register from the available
registers or you make a series which always puts the temporary on the
stack then pops it back off afterwards. optimisations of the same
basic crude strategy is where it gets messy.

all of which kinda points inevitably to having the BE.ST instruction
and if you're going to have that you might as well have the BE.LD one
as well.

l.

Jim Wilson

unread,
Jun 10, 2018, 11:17:21 AM6/10/18
to Michael Clark, Luke Kenneth Casson Leighton, Jacob Bachmeyer, Alex Elsayed, ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
On Sun, Jun 10, 2018 at 5:15 AM, Michael Clark <michae...@mac.com> wrote:
> Single endian architectures like x86 have MOVBE a load store instruction (unlike MOV it takes a register and a memory operand). It was introduced on Haswell which is very recent. It is not on all x86_64 CPUs.

This was added to improve bi-endian icc performance. They never
needed before, which is why they never added it before.

Jim

Jacob Bachmeyer

unread,
Jun 10, 2018, 8:34:05 PM6/10/18
to Michael Clark, Luke Kenneth Casson Leighton, Andrew Waterman, Samuel Falvo II, Allen Baum, RISC-V ISA Dev, Shumpei Kawasaki
Michael Clark wrote:
>> On 10/06/2018, at 12:55 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> Luke Kenneth Casson Leighton wrote:
>>
>>> On Sat, Jun 9, 2018 at 10:45 AM, Luke Kenneth Casson Leighton
>>> <lk...@lkcl.net> wrote:
>>>
>>>> On Sat, Jun 9, 2018 at 10:30 AM, Andrew Waterman <and...@sifive.com> wrote:
>>>>
>>>>> My point is that there will be such a hit, no matter which approach is
>>>>> taken. There's effectively no room in RVC to encode new big-endian
>>>>> loads and stores. There's effectively no room in RVI to encode new
>>>>> big-endian loads and stores with 12-bit offsets. So, you're left
>>>>> either wider instructions or two-instruction sequences.
>>>>>
>>>> there is another option: the conflict-resolution scheme. it was
>>>> discussed a couple months back, and is "effectively" as if 32-bit (or
>>>> other sized) opcodes had been extended (by some hidden bits that are
>>>> set with a CSR).
>>>>
>>> p.s. jacob already came up with a corresponding / matching scheme for
>>> compilers / binutils, which takes the hidden prefix into account and
>>> walks it through from gcc to binutils to actual assembler.
>>>
>> The extensible assembler database I proposed assumes that each processor will have *one* and *only* one set of recognized instructions. (The "hidden prefix" is the immutable vendor/arch/impl tuple in my proposals.) In other words, it moves the conflict-resolution to the RISC-V ecosystem as a whole, allowing different implementations to have different (possibly overlapping) non-standard extensions, while avoiding "hidden state" in the instruction decoder. (Which simplifies analysis of RISC-V binaries. If (as we all want) RISC-V becomes popular, there *will* eventually be RISC-V malware that will need to be reverse-engineered. Multiple interpretations (as arranged for by the "conflict resolution" proposals) will only complicate this and possibly delay discovery of such malware. Consider a worm that targets some kind of IoT device with particular unusal extensions and uses those to avoid executing in honeypots that do not have them. Crashes when such a worm infests a non-targeted device are much more obvious then "oh, this processor does not support instruction set XYZ, so it is not the IoT device I seek; lay low and spread more".)
>>
>> If big-endian memory access is standardized as LOAD/GREV and GREV/STORE fusion pairs, represented with assembler pseudo-instructions, then the extensible assembler database I propose would permit those pseudo-instructions to be overridden for a specific implementation that might, for example, choose to use CUSTOM-0 and CUSTOM-1 as big-endian LOAD and STORE. The compiler simply generates "LW.BE" and the assembler either expands that to a LOAD/GREV pair or an implementation-specific encoding, which may be a non-standard 32-bit opcode.
>>
>
> And binary compatibility goes out the window.
>

Yes, that is the inherent cost of using non-standard instruction
encodings: programs that take advantage of the non-standard encoding
are not portable, period.


-- Jacob

Jacob Bachmeyer

unread,
Jun 10, 2018, 8:39:05 PM6/10/18
to Luke Kenneth Casson Leighton, Jim Wilson, Alex Elsayed, ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
Luke Kenneth Casson Leighton wrote:
> On Sun, Jun 10, 2018 at 5:54 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>
>> The resultant code would then be more like: (hand-written assembler)
>>
>> sub:
>> LUI a5, %hi(bar)
>> SW.BE t0, a0,%lo(bar)(a5)
>> JR ra
>>
>> Unfortunately, the big-endian STORE needs a temporary that it can clobber,
>> but this does not affect macro-op fusion because STORE otherwise does not
>> write to the register file.
>>
>
> oo. ah. right. can i possibly clarify, as this is quite important:
> is it a known guaranteed-defined characteristic of macro-op fusion
> that the register utilised as an intermediate does NOT modify the main
> register file? if it is, it doesn't quite sound right, because a
> system *not* having macro-op fusion would act differently from one
> that did, and that would be bad.
>

Because ordinary STORE never writes to the register file, macro-op
fusion is still possible; the fused macro-op simply writes to both
memory and a register.

Another option is to make the big-endian STORE a three-instruction
sequence: "SW.BE rs, (addr)" -> "BSWAP.W rs, rs/SW rs, (addr)/BSWAP.W
rs, rs"


-- Jacob

Luke Kenneth Casson Leighton

unread,
Jun 10, 2018, 9:33:01 PM6/10/18
to Jacob Bachmeyer, Jim Wilson, Alex Elsayed, ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
On Mon, Jun 11, 2018 at 1:39 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> Another option is to make the big-endian STORE a three-instruction sequence:
> "SW.BE rs, (addr)" -> "BSWAP.W rs, rs/SW rs, (addr)/BSWAP.W rs, rs"

so... that's like the 2-register in-place swap algorithm which uses
XOR. swap the register to be stored (in-place), store it, then swap
it *back* to the original value.

and michael described how LD macro-op works (because with LD you can
use the dest reg for the LD as a temporary register), so... it kinda
all works out in the end.

that's if 3-ops and 2-ops can be tolerated where one would be needed
if there was a be mode. so it comes down to performance analysis.

but... even before we get there, all of this places xBitManip at a
much higher priority than it has gotten at present... and the BM WG
was shut down unceremoniously when AMD wouldn't agree to the RISC-V
Foundation Patent terms...

l.

Jacob Bachmeyer

unread,
Jun 10, 2018, 10:45:34 PM6/10/18
to Luke Kenneth Casson Leighton, Jim Wilson, Alex Elsayed, ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
Luke Kenneth Casson Leighton wrote:
> On Mon, Jun 11, 2018 at 1:39 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Another option is to make the big-endian STORE a three-instruction sequence:
>> "SW.BE rs, (addr)" -> "BSWAP.W rs, rs/SW rs, (addr)/BSWAP.W rs, rs"
>>
> so... that's like the 2-register in-place swap algorithm which uses
> XOR. swap the register to be stored (in-place), store it, then swap
> it *back* to the original value.
>
> and michael described how LD macro-op works (because with LD you can
> use the dest reg for the LD as a temporary register), so... it kinda
> all works out in the end.
>
> that's if 3-ops and 2-ops can be tolerated where one would be needed
> if there was a be mode. so it comes down to performance analysis.
>

The only cost is code size -- both of those sequences are amenable to
macro-op fusion. Further, since they would be expanded from assembler
pseudo-ops, embedded systems that really need fast big-endian support
could supply non-standard 32-bit encodings (in the extensible assembler
database) that would override the pseudo-op expansion when assembling
non-portable code for those implementations.

> but... even before we get there, all of this places xBitManip at a
> much higher priority than it has gotten at present... and the BM WG
> was shut down unceremoniously when AMD wouldn't agree to the RISC-V
> Foundation Patent terms...

I think that you might "know someone" who could start a new BM WG. :-)
Or RVXBitManip could become a community-originated RVB proposal.


-- Jacob

Luke Kenneth Casson Leighton

unread,
Jun 10, 2018, 11:52:15 PM6/10/18
to Jacob Bachmeyer, Jim Wilson, Alex Elsayed, ron minnich, Madhu, Guy Lemieux, Allen Baum, RISC-V ISA Dev
On Mon, Jun 11, 2018 at 3:45 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> I think that you might "know someone" who could start a new BM WG. :-) Or
> RVXBitManip could become a community-originated RVB proposal.

:) there's a couple of options, i'll start sending out some enquiries.

Madhu

unread,
Jun 11, 2018, 12:47:11 AM6/11/18
to Jacob Bachmeyer, Luke Kenneth Casson Leighton, Jim Wilson, Alex Elsayed, ron minnich, Guy Lemieux, Allen Baum, RISC-V ISA Dev
The Shakti team is playing around with a few options. We figured we
will propose something formal once we
get a bit of data from our experiments. Lots of items jockeying for
our dev time, so this was not on our list
of priorities till now. But now that we have customers asking for BE
support and complaints of code size
we have started some work in this area. But the deliberations here
have been closely followed !
--
Regards,
Madhu

Allen Baum

unread,
Jun 11, 2018, 1:11:39 AM6/11/18
to Jim Wilson, Michael Clark, Luke Kenneth Casson Leighton, Jacob Bachmeyer, Alex Elsayed, ron minnich, Madhu, Guy Lemieux, RISC-V ISA Dev
This conversation is getting a bit fragmented and is sprawling g a bit more than I had expected. Two points:
1.the object is to enable existing big-endian code written in a high level language with no little-endian support to be compiled and executed on an RV correctly. Performance should be good, not perfect, as a lot of this older code will be running on older, slower architectures (an assumption on my part).

2. Ldx, GRevI combination doesn't work for signed operands on an RV if X<xlen.
To support that you need to LD, GrevI, SRA. Combining GrevI with a sign extension option could work- but only by limiting the general case to the specific cases that matter ( so, it would remove the "General" part of the mnemonic)

-Allen

Luke Kenneth Casson Leighton

unread,
Jun 11, 2018, 1:12:13 AM6/11/18
to Madhu, Jacob Bachmeyer, Jim Wilson, Alex Elsayed, ron minnich, Guy Lemieux, Allen Baum, RISC-V ISA Dev
On Mon, Jun 11, 2018 at 5:47 AM, Madhu <ma...@macaque.in> wrote:

> The Shakti team is playing around with a few options. We figured we
> will propose something formal once we
> get a bit of data from our experiments.

awesome! could you elaborate in which areas (gcc? spike?) so that
there's no duplication of effort if anyone else wishes to contribute
or work on this in parallel?

l.

Samuel Falvo II

unread,
Jun 11, 2018, 1:28:20 AM6/11/18
to Allen Baum, Jim Wilson, Michael Clark, Luke Kenneth Casson Leighton, Jacob Bachmeyer, Alex Elsayed, ron minnich, Madhu, Guy Lemieux, RISC-V ISA Dev
On Sun, Jun 10, 2018 at 10:11 PM, Allen Baum
<allen...@esperantotech.com> wrote:
> 2. Ldx, GRevI combination doesn't work for signed operands on an RV if X<xlen.
> To support that you need to LD, GrevI, SRA. Combining GrevI with a sign extension option could work- but only by limiting the general case to the specific cases that matter ( so, it would remove the "General" part of the mnemonic)

This is why I mentioned the need to have different byte swap
instructions for different word widths. Especially since we lack a
rotate instruction. With x86, XCHG xH,xL would work to byte-swap a
16-bit word, BSWAP for 32-bit word, and ROL or ROR for a 64-bit word.
They can get away with having only a single byte-swap instruction.
For us, not so much.

--
Samuel A. Falvo II

Jacob Lifshay

unread,
Jun 11, 2018, 5:18:01 AM6/11/18
to Samuel Falvo II, Allen Baum, Jim Wilson, Michael Clark, Luke Kenneth Casson Leighton, Jacob Bachmeyer, Alex Elsayed, ron minnich, Madhu, Guy Lemieux, RISC-V ISA Dev
x86_64 has both a 32 and 64-bit version of BSWAP: https://www.felixcloutier.com/x86/BSWAP.html
I'm not sure how you got a 64-bit rotate to byte-swap a 64-bit value, that doesn't make sense to me.

Jacob Lifshay

lk...@lkcl.net

unread,
Jun 11, 2018, 5:27:49 AM6/11/18
to RISC-V ISA Dev, ji...@sifive.com, michae...@mac.com, lk...@lkcl.net, jcb6...@gmail.com, etern...@gmail.com, rmin...@gmail.com, ma...@macaque.in, glem...@vectorblox.com


On Monday, June 11, 2018 at 6:11:39 AM UTC+1, Allen Baum wrote:
 
2. Ldx, GRevI combination doesn't work for signed operands on an RV if X<xlen.

 ohh... so... if you have to load a 16-bit word from memory, and it's to be signed, the sign-extension is done from the *first* byte (in memory) not the second, but a standard (little-endian) signed LD would take the *second* byte to do the sign-extension from.  bit 15 not bit 7.

 ...yuk!
 
To support that you need to LD, GrevI, SRA. Combining GrevI with a sign extension option could work- but only by limiting the general case to the specific cases that matter ( so, it would remove the "General" part of the mnemonic)


it doesn't quite feel right, to make such a specialist operation (GREVI.Signed).  as in: if there was any specialist operation to be made and added, LD.BE (and LD.BE.Signed) would be better candidates.

l.

Clifford Wolf

unread,
Jun 11, 2018, 1:14:59 PM6/11/18
to RISC-V ISA Dev
Hi,

I just realized I did send this reply to Allen only, but it might actually be of interest to anyone else participating in this mail thread as well..

PS: I've now added byte-swap+sign-extend instructions to XBitmanip.
See https://raw.githubusercontent.com/cliffordwolf/xbitmanip/master/xbitmanip-draft.pdf

---------- Forwarded message ---------
From: Clifford Wolf <cliffor...@gmail.com>
Date: Sat, Jun 9, 2018 at 2:48 PM
Subject: Re: [isa-dev] Bit Manipulation and Big Endian support
To: Allen Baum <allen...@esperantotech.com>


Hi,

On Fri, Jun 8, 2018 at 11:50 PM Allen Baum <allen...@esperantotech.com> wrote:
So 4 possible levels of support:
 - none
 - swap instructions
 - BigEndian load/store mode
 - BigEndian load/store instructions

I think dedicated BE load/store instructions with (A) smaller immediate or (B) in 48-bit
encoding space or (C) not all instructions are the right choice here. I actually think (C)
is best. Below is my argument why. (Skip to the end of this mail if you just want to see
my proposal.)

The issues I see with swap instructions are that they do not fix sign extension, at least that's the
case with the XBitmanip instructions. So if you perform say a 16-bit unsigned load from
memory, then bswap.h is sufficient to fix the endianness. But for a signed load one would need
to add an additional slli+srai pair for the correct sign extension after endianness conversion. (For
32 bit values on RV64 a single addiw instruction can be used for performing sign extension.)

A BigEndian mode sounds messy to work with tbh. But it would solve the issue of encoding space
for BE load/store because only a CSR would be needed, not any additional instructions. (On the
other hand many architectures have successfully used modes to switch between LE and BE, such
as ARM, MIPS, PowerPC, and Alpha, if my memory serves me right.)

To collect some data I've disassembled all binaries in the debian-riscv64-tarball-20180418.tar.gz
base image and extracted all load/store instructions to do some static instruction frequency
analysis. (This is a dataset of 1M load instructions and 0.5M store instructions.)


38% of all loads are relative to sp and 64% of all stores are relative to sp. Arguably most
of them are register spills to the stack for which no pointer is generated, thus the C compiler
would be free to store them in little endian without LE/BE conversion. Similarly LE/BE
conversion could be omitted for compiler-generated data (such as jump tables) by already
storing this information in little-endian. (This kind of optimizations may be something that's
performed for -Os and -O3 only.)

Including instructions from RV128, the LOAD opcode is completely full already:

LOAD OPCODE: lb, lbu, lh, lhu, lw, lwu, ld, ldu

There are 5 minor opcodes used in STORE (thus 3 are free):

STORE OPCODE: sb, sh, sw, sd, sq

And there are 3 minor opcodes used in MISC-MEM (thus 5 are free):

MISC-MEM OPCODE: fence, fence.i, lq

The following 11 instructions would be needed to fully support big endian load/store
in RV32, RV64, and RV128:


However, 98% of the load/store instructions in my dataset (97% if we do not include
load/store relative to sp) are either LD/LW/SD/SW instructions or are LB/LBU/SB
instructions. No big-endian versions of LB/LBU/SB are needed and there is sufficient
space in STORE for sd.be + sw.be and there is sufficient space in MISC-MEM for

------------------------------------------------------------------------------------

So therefore my suggestion would be the following:

1. Add big-endian versions of sd and sw using two free minor opcodes in STORE.

2. Add big-endian versions of ld and lw using two free minor opcodes in MISC-MEM.

3. Add support for bswap.h, bswap.w, bswap, (and bswap.d for RV128) from XBitmanip
for performing endian conversion of unsigned values. Those opcodes would then be
shared between the two ISA extensions.

4. Add additional instructions econv.h, econv.w, (and econv.d for RV128) for endian
conversion of signed values. (This could use some of the reserved instruction encoding
space within the XBitmanip family of "generalized zip" instructions.)

This would provide big-endian capable simple 32-bit encodings for over 95% of all
load/store instructions, and would allow the remaining cases to be implemented using
relatively efficient two-instruction sequences.

This is less orthogonal than adding support for BE version of all load/store ops using 48-bit
encodings, but I think it would yield significantly more efficient code than using 48-bit
load/store ops.

------------------------------------------------------------------------------------

However, if four I-type/S-type instructions is deemed too costly in terms of encoding space,
then I think a CSR flag for switching between LE and BE mode would probably be the
best remaining option.

(In that case an additional extension providing 48-bit instructions for the other edianness
might in fact be useful for applications that have to process a lot of data in both little and big
endian format.)

regards,
 - clifford

PS: Obligatory XBitmanip link:

PPS: Another solution would be to provide BE version of load/store ops with a smaller
immediate. 10 bits instead of 12 bits would still work for 90% of the cases according
to my data. But that would make a mess for things like handling of load/store global
pseudo-instructions.

Luke Kenneth Casson Leighton

unread,
Jun 11, 2018, 9:24:53 PM6/11/18
to Clifford Wolf, RISC-V ISA Dev
On Mon, Jun 11, 2018 at 6:14 PM, Clifford Wolf <cliffor...@gmail.com> wrote:

> Hi,
>
> I just realized I did send this reply to Allen only, but it might actually
> be of interest to anyone else participating in this mail thread as well..

yes definitely

> PS: I've now added byte-swap+sign-extend instructions to XBitmanip.
> See
> https://raw.githubusercontent.com/cliffordwolf/xbitmanip/master/xbitmanip-draft.pdf

great! that will help out the Shakti Team to do some evaluation of it.

> A BigEndian mode sounds messy to work with tbh. But it would solve the issue
> of encoding space
> for BE load/store because only a CSR would be needed, not any additional
> instructions. (On the
> other hand many architectures have successfully used modes to switch between
> LE and BE, such
> as ARM, MIPS, PowerPC, and Alpha, if my memory serves me right.)

a mode-switch's wires would come in to the actual instruction(s) to
change their behaviour; the conflict-resolution-mode's wires would
come into the instruction demux selector(s) to change *their*
behaviour.... topologically, netlist-wise and boolean-logic-wise the
difference is zero-to-negligeable.


> To collect some data I've disassembled all binaries in the
> debian-riscv64-tarball-20180418.tar.gz
> base image and extracted all load/store instructions to do some static
> instruction frequency
> analysis. (This is a dataset of 1M load instructions and 0.5M store
> instructions.)
>
> https://nbviewer.jupyter.org/url/svn.clifford.at/handicraft/2018/rvinsfreq/ldst.ipynb
> http://svn.clifford.at/handicraft/2018/rvinsfreq/ldst.sh

superb! this is excellent clifford and extremely useful.

i thought about it overnight and i have some questions, based on the following:

* we know from what Madhu of IIT Madras mentioned, that they have
been asked to design a processor that will replace very old industrial
systems. from talking to them i know that these are VME-based, so the
I/O cards that plug into these systems will remain.

* we also know that japan has a particular requirement, we know that
there is someone (whom i know talked to mafm and yunsup) who maintains
the unofficial debian powerpc-be port and it gets *massive* amounts of
traffic... but we do not know at this point exactly what the use-cases
are.

so the questions are:

(1) this is a static analysis. how do we know that a static analysis
(even though certain LD/ST width-types are most commonly generated)
will give the right *dynamic* usage statistics? certain I/O loops for
example would increase the use-count significantly, and we understand
from the industrial systems it's the I/O on these bi-endian systems
that does the most work.

(2) the linux kernel - where much of the I/O would be doing much of
its work - would that be doing mostly 16-bit I/O or would it be doing
mostly 32-bit I/O, or 64-bit?
(2a) on these legacy systems with 68020s how likely is it that
they're running a full OS (let alone a linux-based kernel)?

(3) would it be better to use statistical instrumentation using
either qemu or spike to get some dynamic statistics (if indeed that's
practical)?

continued...

>
> ------------------------------------------------------------------------------------
>
> So therefore my suggestion would be the following:
>
> 1. Add big-endian versions of sd and sw using two free minor opcodes in
> STORE.
>
> 2. Add big-endian versions of ld and lw using two free minor opcodes in
> MISC-MEM.

... the analysis technique you've developed is superb and also
highlights that there *is* actually potential (incomplete) space for
relevant minor opcodes, which is good news. that in itself is
extremely valuable information.

would it be best to wait until someone has done some dynamic analysis
of industrial and other use-cases before proceeding?


> ------------------------------------------------------------------------------------
>
> However, if four I-type/S-type instructions is deemed too costly in terms of
> encoding space,
> then I think a CSR flag for switching between LE and BE mode would probably
> be the
> best remaining option.
>
> (In that case an additional extension providing 48-bit instructions for the
> other edianness
> might in fact be useful for applications that have to process a lot of data
> in both little and big
> endian format.)

that's an extremely useful insight, clifford. one of the issues with
mode-swapping / conflict-resolution-swapping is, you can't (shouldn't
/ mustn't) use it to cross function-call (ABI) boundaries (unless of
course the ABI is itself B.E in which case you *must* mode-swap to
B.E. before calling). so it means that any given function would have
to swap in and out constantly of the BE mode.

l.

Tommy Thorn

unread,
Jun 12, 2018, 1:51:51 PM6/12/18
to Clifford Wolf, RISC-V ISA Dev
Not correct.  For the signed extended sub-word case you only need two instructions.  Eg. 32-bit:

  BSWAP rd, rs1
  SRA rd, rs1, 32

Similarily, you don't need a BSWAP.W as you can do with with BSWAP + SRL

Unpopular opinion: I don't think RISC-V have to satisfy everyone.  The founders used to brag about how
small the spec was ("Shorter than the Arm table of contents").  It's no longer small and there's a distinct
featuritish kitchen-sink creep.

Even adding a single instruction like BSWAP cost something, and certainly in an FPGA softcore, having
it will affect the critical path and thus penalize _everything_ else.  As I repeat nearly every time set B is
discussed: PLEASE QUANTIFY!  I challenge everyone who cares about this to SHOW measurements from REAL
code that a the feature they propose will reduce the instruction count by more than, say, 3%.

Tommy


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAG5EYeXdNJuUDSW9OQGKHnANHDtZYzymNFjicdQQ5N5c4HFPuQ%40mail.gmail.com.

Clifford Wolf

unread,
Jun 12, 2018, 2:10:42 PM6/12/18
to Tommy Thorn, RISC-V ISA Dev
Hi,


On Tue, Jun 12, 2018, 19:51 Tommy Thorn <tommy...@esperantotech.com> wrote:
Not correct.  For the signed extended sub-word case you only need two instructions.  Eg. 32-bit:

  BSWAP rd, rs1
  SRA rd, rs1, 32

Yes, but still two instructions, which is a lot if you have to add it to every load and store.

Even adding a single instruction like BSWAP cost something, and certainly in an FPGA softcore, having
it will affect the critical path and thus penalize _everything_ else. As

This is why this is not part of the base spec but proposed as part of an extension.

It's up to the implementer if they want to support it. Arguing that the B extension should not contain anything becaus everything costs something is idiotic since someone who doesn't want to invest anything at all would simply opt to not implement the extension.

I repeat nearly every time set B is
discussed: PLEASE QUANTIFY!  I challenge everyone who cares about this to SHOW measurements from REAL
code that a the feature they propose will reduce the instruction count by more than, say, 3%.

The way you say that implies that you think this is not the plan. But if you would have read the xbitmanip document you would know that this is exactly the next planned step.

It even contains multiple remarks pointing out that some features are included "for now" simply because it is easier to collect data on features that are included in the draft spec.

Reading before making all caps demands sometimes really pays off. You should try it.

Regards,
 Clifford

Shumpei Kawasaki

unread,
Jun 12, 2018, 7:22:44 PM6/12/18
to Tommy Thorn, Allen Baum, RISC-V ISA Dev

Most architecture starts with one endian and ends up adopting some form of bi-endianness some years later. W.r.t. endian choice everything works just fine so long as the system exists as an isolated island in the world. Being left-handed or right-handed does not matter until you start playing a sport. I believe bi-endian discussion is a sign that RISC-V is now going through a real world adoption phase where $ is at stake. 

2018-06-09 6:05 GMT+09:00 Tommy Thorn <tommy...@esperantotech.com>:
Hi Allen,

One data point: RISC-V was originally bi-endian, but overwhelmingly the western world have settled on little and it greatly simplified the standard to drop it.  I don't think adding native BigEndian makes sense but adding support for various swaps does make a ton of sense.

Tommy


On Jun 8, 2018, at 14:02 , Allen Baum <allen...@esperantotech.com> wrote:

i've heard more than once that in Japan (and perhaps China), the overwhelming majority of microcontrollers and embedded processors are Big-Endian. This is, I believe, inhibiting the support of RiscV in those geopgraphies (not eliminating it, but certainly slowing it down ).

The only support that I've heard of for big-endian is the currently defunct BitManipulation WG, and even there the support was for swapping bytes after they've been loaded and before they are store.

a. Is that adequate?
b. If not, do we expect anyone who wants native BigEndian support to develop their own custom extension?
c. if not - have there been any discussions for a standard BigEndian discussion?

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Shumpei Kawasaki

unread,
Jun 12, 2018, 7:32:20 PM6/12/18
to Luke Kenneth Casson Leighton, Jacob Bachmeyer, Allen Baum, RISC-V ISA Dev

Bi-endianness is proposed as an optional architectural feature. RISC-V as a whole will not become a "second rate citizen" if we choose to put this matter that way. 

GCC, Linux and UEFI inherently supports bi-endianness and the effort which went into those projects to make them bi-endian was serious effort by serious engineers. 

You are correct that at this point it's almost too late. That is why we are planning to get special funding in 2018 to escalate this effort. 

2018-06-09 11:01 GMT+09:00 Luke Kenneth Casson Leighton <lk...@lkcl.net>:
On Sat, Jun 9, 2018 at 2:39 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> There is no room in the 32-bit opcode space to put big-endian LOAD/STORE,
> but an extension could easily add these as 48-bit or 64-bit opcodes.

 the consequences of that are that it would make big-endian a "second
rate citizen"... although at this point it's almost too late.
https://en.wikipedia.org/wiki/Endianness#Bi-endianness seems to me to
imply that the equivalent of CSRs may have been used historically to
set endian-ness.

l.

Shumpei Kawasaki

unread,
Jun 12, 2018, 7:35:05 PM6/12/18
to Jacob Bachmeyer, Luke Kenneth Casson Leighton, Allen Baum, RISC-V ISA Dev

>RISC-V program text is little-endian, and changing *that* would make a huge mess. 

A very critical point.  

2018-06-09 11:13 GMT+09:00 Jacob Bachmeyer <jcb6...@gmail.com>:
Luke Kenneth Casson Leighton wrote:
On Sat, Jun 9, 2018 at 2:39 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
 
There is no room in the 32-bit opcode space to put big-endian LOAD/STORE,
but an extension could easily add these as 48-bit or 64-bit opcodes.
   

 the consequences of that are that it would make big-endian a "second
rate citizen"... although at this point it's almost too late.
https://en.wikipedia.org/wiki/Endianness#Bi-endianness seems to me to
imply that the equivalent of CSRs may have been used historically to
set endian-ness.

I do not see a serious problem here, since that proverbial ship has arguably already sailed:  RISC-V program text is little-endian, and changing *that* would make a huge mess.  Further, the use of additional big-endian memory access opcodes would make RISC-V truly bi-endian, with the big-endian/little-endian distinction being made at runtime and encoded into the program text, rather than being an implicit parameter.  I argue that this is a better fit, since it would allow/require the expected byte order for data to be explicitly stated in the program.

Lastly, (and this ties back to the extensible assembler database I proposed earlier) standardizing big-endian memory access as 48-bit or 64-bit opcodes does not preclude implementations from "aliasing" those long-form standard opcodes into the 32-bit opcode space as non-standard encodings of standard instructions.


-- Jacob

Luke Kenneth Casson Leighton

unread,
Jun 12, 2018, 9:05:32 PM6/12/18
to Shumpei Kawasaki, Jacob Bachmeyer, Allen Baum, RISC-V ISA Dev
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68


On Wed, Jun 13, 2018 at 12:32 AM, Shumpei Kawasaki
<shumpei....@swhwc.com> wrote:
>
> Bi-endianness is proposed as an optional architectural feature. RISC-V as a
> whole will not become a "second rate citizen" if we choose to put this
> matter that way.

ah that's not quite what i said: i said that i was concerned that
*bi-endian*-support-in-RISC-V would be made a 2nd-rate-citizen [when
compared to little-endian-support-in-RISC-V], not RISC-V-as-a-whole
would be made a 2nd-rate-citizen.

by that i meant that i was concerned that the bi-endian support would
require say 2-5 instructions where little-endian requires only one.
and that, as a result, in I/O performance-critical loops (particularly
say on low-power or low-speed industrial controllers that do I/O at a
bit-banging / byte-banging level), they might not even be able to keep
up with the required bus speed.

also although i do not know exact details, with the various
multi-instruction proposals i would be concerned that those
instructions are not atomic (i.e. could not be macro-op-fused in some
architectures), thus in some cases (linux kernel) requiring spinlocks
to be placed around them. now it is definitely not 2-5 instructions
per LD/ST, it's a lot more.

does that help clarify, shumpei?

> GCC, Linux and UEFI inherently supports bi-endianness and the effort which
> went into those projects to make them bi-endian was serious effort by
> serious engineers.
>
> You are correct that at this point it's almost too late. That is why we are
> planning to get special funding in 2018 to escalate this effort.

awesome.

Cesar Eduardo Barros

unread,
Jun 12, 2018, 9:41:56 PM6/12/18
to Allen Baum, Samuel Falvo II, Tommy Thorn, RISC-V ISA Dev
Em 08-06-2018 18:50, Allen Baum escreveu:
> So 4 possible levels of support:
>  - none
>  - swap instructions
>  - BigEndian load/store mode
>  - BigEndian load/store instructions

I'm a bit late to this discussion, so instead of replying to several
subthreads, I'll try to summarize here several possibilities, many I've
seen in this discussion and a few of my own. I apologize in advance for
the length of this email message.

I don't currently have an opinion on which is the best one, so I'll try
to list the pros and cons of each one (help me find more). I'll also
give each possibility a number to simplify further discussion.

These options are not all exclusive, for instance we could have both 3a
and 4b (in this case, 3a does the opposite of whichever the current mode
is).

These options are all about data access. RISC-V code is always in
little-endian order. An alternative would be a "middle endian" order in
which each 16-bit parcel is in big endian order, but to keep our sanity
let's pretend that this possibility doesn't exist and that I didn't
mention it.


Option 0a: "none"

That is, keep using shifts for byte order swaps.

- Pros: simplest option; no work needed; no new instructions; easy to
understand

- Cons: byte order swap is costly, and cannot be easily macro-op fused;
byte order swaps are slow and need an extra temporary register; code
size bloat


Option 0b: "always big-endian"

This means a RISC-V core hard-wired to always read and write data in big
endian order.

- Pros: simple to implement; no new instructions; easy to understand

- Cons: incompatible with the whole RISC-V software ecosystem; can't
really be called RISC-V; splits the software ecosystem in half; we've
been there with other bi-endian ISAs and it sucks


Option 0c: "memory endian"

The endianness is an attribute of the memory region. Loads and stores on
a big-endian region are big-endian; loads and stores on a little-endian
region are little-endian.

- Pros: no new instructions; a natural fit when the problem is talking
to big-endian I/O devices; mostly transparent to software

- Cons: doesn't easily allow for mixed endianness, for instance in a
network stack; hardcodes the endianness of a region of memory;
endianness is implicit, which can be confusing; byte access order varies
depending on the value of the register with the address, which can be
confusing


Option 0d: "page table endian"

Same as option 0c, but the endianness is an attribute of the data page,
controlled by the page table.

- Pros: no new instructions; more granular than option 0c; can be
changed at runtime

- Cons: uses one bit on every page table entry; granularity is still too
large; endianness is implicit, which can be confusing; byte access order
varies depending on the value of the register with the address and on
the page tables, which can be really confusing


Option 1a: fused "BSWAP+SRA"

A single new "BSWAP rd, rs" instruction, which reverses the byte order
of a whole register, optionally followed by (and macro-op fused with) a
"SRA rd, rd, imm".

- Pros: a single new instruction; minimum opcode space use; simple to
implement; easy to understand; we'll probably want a bswap without
memory accesses anyway

- Cons: can't be macro-op fused with a preceding load or following
store, unless the implementation can macro-op fuse a sequence of three
instructions; larger instruction count than the alternatives (except for
option 0a)


Option 1b: sign-extending BSWAP ("BSWAPS")

A set of sign-extending BSWAP instructions, one for each size larger
than a byte.

- Pros: does the swap in a single instruction even for less than XLEN;
can be macro-op fused with memory accesses (see options 2a and 2b
below); low opcode space use; we'll probably want a bswap without memory
accesses anyway

- Cons: needs 2 new instructions for RV32, 3 new instructions for RV64,
and 4 new instructions for RV128


Option 2a: fused "BSWAPS" with load/store

For a big-endian load, do a normal load followed by a sign-extending
BSWAP of the destination register. For a big-endian store, do a
sign-extending BSWAP of the source register followed by a normal store.

- Pros: low opcode space use; good performance when fused; implementing
big-endian memory access is optional (just don't fuse the sign-extending
BSWAP); even when fused, the store is still within the 2R1W register
file port limit

- Cons: needs 2 new instructions for RV32, 3 new instructions for RV64,
and 4 new instructions for RV128; when fused, uses the ALU twice (for
the BSWAP and the memory address computation)


Option 2b: fused "BSWAPS" with restricted load/store

Same as option 2a, but fused only when the immediate offset of the load
or store is zero.

- Pros: low opcode space use; good performance when fused; implementing
big-endian memory access is optional (just don't fuse the sign-extending
BSWAP); even when fused, the store is still within the 2R1W register
file port limit; when fused, uses the ALU only once (for the BSWAP)

- Cons: needs 2 new instructions for RV32, 3 new instructions for RV64,
and 4 new instructions for RV128; needs a separate ADDI instruction (and
possibly a temporary register) when the offset isn't zero


Option 3a: big-endian load and store

New instructions for big-endian loads and stores.

- Pros: good performance; does not need macro-op fusion; simple to
implement; easy to understand; big-endian is not a second-class citizen

- Cons: high opcode space use (stores can use the high bit of funct3,
but loads use it for unsigned, so have to be in another major opcode);
we'll probably want a bswap without memory accesses anyway


Option 3b: restricted big-endian load and store

Same as option 3a, but the new instructions do not have an immediate
field (offset is always zero).

- Pros: acceptable performance; does not need macro-op fusion; low
opcode space use

- Cons: needs 2 new instructions for RV32, 3 new instructions for RV64,
and 4 new instructions for RV128; depending on where in the opcode map
these new instructions are placed, might complicate decoding; needs a
separate ADDI instruction (and possibly a temporary register) when the
offset isn't zero; we'll probably want a bswap without memory accesses
anyway


Option 4a: local mode switch ("endian flag")

A bit in a flags register, which can be freely set by code on any
privilege level, which when set turns all loads and stores into
big-endian loads and stores. For those familiar with the x86 ISA, this
is similar to the "direction flag" (cld/std).

- Pros: no new instructions (it's a CSR)

- Cons: RISC-V doesn't have a non-FPU flags register; complicates
out-of-order implementations; is new state which has to be saved and
restored on every mode switch; has to be cleared on entry to any trap;
has to be saved before being cleared on entry to any trap, with an
unknown endianness for the store instruction; has to be cleared on Unix
signal delivery (see CVE-2008-1367)


Option 4b: global mode switch ("endian process")

A set of bits in a privileged CSR, one for each privilege mode, which
when set turns all loads and stores from that privilege mode into
big-endian loads and stores. Intended to be set once for each process.

- Pros: no new instructions (it's a CSR); easy to understand; avoids
many of the problems of option 4a; the same privilege level is
responsible for both changing the mode and saving and restoring it

- Cons: doesn't easily allow for mixed endianness, for instance in a
network process; needs two copies of shared library code, one for each
endianness; supervisor software might need to temporarily switch
endianness when accessing user memory (similar to the SUM bit)


Option 4c: per-page mode switch

A bit on the page table entries, which when set makes code executing
from that page treat all loads and stores as big-endian.

- Pros: no new instructions; can mix little-endian and big-endian code
in the same process

- Cons: uses one bit on every page table entry; probably complicates the
instruction TLBs, fetch and decode; needs something in the object file
format to specify which code pages have which endianness; instructions
can straddle a page boundary; byte access order varies depending on the
page tables, which can be confusing; I feel dirty for suggesting this one

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Jacob Bachmeyer

unread,
Jun 12, 2018, 10:54:30 PM6/12/18
to Cesar Eduardo Barros, Allen Baum, Samuel Falvo II, Tommy Thorn, RISC-V ISA Dev
Cesar Eduardo Barros wrote:
> [...]
> These options are all about data access. RISC-V code is always in
> little-endian order. An alternative would be a "middle endian" order
> in which each 16-bit parcel is in big endian order, but to keep our
> sanity let's pretend that this possibility doesn't exist and that I
> didn't mention it.

I agree with *not* introducing reverse-PDP-endian program code.

> [...]
> Option 2a: fused "BSWAPS" with load/store
>
> For a big-endian load, do a normal load followed by a sign-extending
> BSWAP of the destination register. For a big-endian store, do a
> sign-extending BSWAP of the source register followed by a normal store.
>
> - Pros: low opcode space use; good performance when fused;
> implementing big-endian memory access is optional (just don't fuse the
> sign-extending BSWAP); even when fused, the store is still within the
> 2R1W register file port limit
>
> - Cons: needs 2 new instructions for RV32, 3 new instructions for
> RV64, and 4 new instructions for RV128; when fused, uses the ALU twice
> (for the BSWAP and the memory address computation)

This is essentially my proposal, with the additional step of defining an
assembler pseudo-instruction that produces the fusion group in standard
code, and that can be overridden to use an implementation-specific
non-standard encoding for non-portable code on implementations that
choose to define 32-bit big-endian memory access encodings.

This could also be implemented using GREVI and an additional sign-extend
instruction for LOAD. (STORE does not care about sign-extension as
those bits are not written to memory.) So now we would have
3-instruction sequences for both LOAD (LOAD/GREVI/SEXT.W) and STORE
(GREVI/STORE/GREVI).

> [...]
> Option 4a: local mode switch ("endian flag")
>
> A bit in a flags register, which can be freely set by code on any
> privilege level, which when set turns all loads and stores into
> big-endian loads and stores. For those familiar with the x86 ISA, this
> is similar to the "direction flag" (cld/std).
>
> - Pros: no new instructions (it's a CSR)
>
> - Cons: RISC-V doesn't have a non-FPU flags register; complicates
> out-of-order implementations; is new state which has to be saved and
> restored on every mode switch; has to be cleared on entry to any trap;
> has to be saved before being cleared on entry to any trap, with an
> unknown endianness for the store instruction; has to be cleared on
> Unix signal delivery (see CVE-2008-1367)

RISC-V *does* have such a register: mstatus and its restricted views in
lower privilege modes. (There is a ustatus CSR, currently empty unless
RVN is implemented.) The catch of course is that mstatus has very few
bits remaining on RV32.

Unknown endianness should not be too much of a problem on trap entry,
since the endianness will need to be restored before the trap handler
returns. The real "fun" comes when supervisors want to look at the
saved values and need to possibly byte-swap or switch endianness before
reading the saved context. The actual return is much easier: restore
sstatus at some defined point, analogous to the point where endianness
was set on trap entry. The remaining saved context is in the endianness
of the user process, since it was saved before endianness was set.

This closely aligns with option 4b and is essentially 4b with each
privilege level able to manipulate its own flag, with some small
complexities related to trap handling due to using a single shared flag.


> Option 4b: global mode switch ("endian process")
>
> A set of bits in a privileged CSR, one for each privilege mode, which
> when set turns all loads and stores from that privilege mode into
> big-endian loads and stores. Intended to be set once for each process.
>
> - Pros: no new instructions (it's a CSR); easy to understand; avoids
> many of the problems of option 4a; the same privilege level is
> responsible for both changing the mode and saving and restoring it
>
> - Cons: doesn't easily allow for mixed endianness, for instance in a
> network process; needs two copies of shared library code, one for each
> endianness; supervisor software might need to temporarily switch
> endianness when accessing user memory (similar to the SUM bit)

If one of those bits is the U-mode endianness flag, and U-endian can be
set in U-mode, then this is very similar to option 4a, with one flag per
privilege level instead of a single shared flag.


-- Jacob

Luke Kenneth Casson Leighton

unread,
Jun 12, 2018, 11:05:51 PM6/12/18
to Cesar Eduardo Barros, Allen Baum, Samuel Falvo II, Tommy Thorn, RISC-V ISA Dev
non-atomic.

> Option 1a: fused "BSWAP+SRA"
>
> - Cons: can't be macro-op fused with a preceding load or following store,
> unless the implementation can macro-op fuse a sequence of three
> instructions; larger instruction count than the alternatives (except for
> option 0a)

can't be macro-op fused *at all* if the architecture does not support
macro-op fusion.... therefore is non-atomic.

> Option 4a: local mode switch ("endian flag")
> Option 4b: global mode switch ("endian process")

> - Cons: RISC-V doesn't have a non-FPU flags register; complicates
> out-of-order implementations; is new state which has to be saved and
> restored on every mode switch; has to be cleared on entry to any trap; has
> to be saved before being cleared on entry to any trap, with an unknown
> endianness for the store instruction; has to be cleared on Unix signal
> delivery (see CVE-2008-1367)

jacob and i went through this in some detail and came up with a
solution. these option(s) are functionally / topologically *directly*
equivalent to the isa-conflict-resolution scheme.

in other words we already worked out that you have *separate*
mode-switch CSRs for U and M mode, where the hardware *automatically*
flips between them, such that the problem of "clearing on trap" melts
away.

the signal-delivery i didn't know about but it makes a lot of sense.

l.

Allen Baum

unread,
Jun 13, 2018, 12:45:29 AM6/13/18
to Cesar Eduardo Barros, Samuel Falvo II, Tommy Thorn, RISC-V ISA Dev
I don’t think this uses the ALU twice.
The shifter and adder are really separate functional units, and in many micro architectures the address generator could be a completely separate functional unit than the main adder.

And even more to the point: the fused BSWAP unit is really on the load aligner path, a different pipestage from the main execution unit, so should be separate from the main shifter (much less main adder).

-Allen

> On Jun 12, 2018, at 6:41 PM, Cesar Eduardo Barros <ces...@cesarb.eti.br> wrote:
>
> Option 2a: fused "BSWAPS" with load/store
>
> For a big-endian load, do a normal load followed by a sign-extending BSWAP of the destination register. For a big-endian store, do a sign-extending BSWAP of the source register followed by a normal store.
>
> ...

Luke Kenneth Casson Leighton

unread,
Jun 13, 2018, 1:25:39 AM6/13/18
to Allen Baum, Cesar Eduardo Barros, Samuel Falvo II, Tommy Thorn, RISC-V ISA Dev
On Wed, Jun 13, 2018 at 5:45 AM, Allen Baum
<allen...@esperantotech.com> wrote:

> I don’t think this uses the ALU twice.
> The shifter and adder are really separate functional units, and in many micro architectures the address generator could be a completely separate functional unit than the main adder.
>
> And even more to the point: the fused BSWAP unit is really on the load aligner path, a different pipestage from the main execution unit, so should be separate from the main shifter (much less main adder).
>

oh, that reminds me. in SV i was looking at how things like RGB565
video/image processing could be done with 16-bit BEXT, 32-bit SIMD
then 32-to-16-bit BDEP, and how that's three instructions when it
really feels like it should be one.

ways to achieve that would be to have optional pre *and* post pipeline
stages wrapped around the ALU(s) that perform bit-manipulation prior
and post ops.

the thought occurred to me that it would be kinda nice to have that as
a generic, general-purpose "thing", done as CSRs/modes (obviously),
and LD/ST could hypothetically be treated as being no different.

non-optimising implementations would, instead of actually having
full/partial duplications of xBitManip in pre *and* post phases of the
pipeline, simply insert the appropriate *actual* opcodes into the
pipeline instead (or, if it has one, into the instruction FIFO).

this would have the distinct advantage of reducing the instruction
count (by "hiding" the implied BSWAP/GREVI etc.)

also, before anyone says "that's hugely esoteric, obscure and
therefore a pain to implement", well.. actually, it's kiinda pretty
much exactly what a hardware implementation of LD.BE / ST.BE (or a
mode-switched variant of LD / ST) would have to do *anyway*. except
maybe not both pre- and post- at the same time.

oh. another thought. there's a refinement of the LD.BE / ST.BE
mode-switch concept (that comes from SV): a *per-register* CSR table
that marks *specific* registers as "big-endian swapped if used on a
LD/ST". this would definitely *not* be a mode-switch concept as it
involves marking individual registers and that requires a look-up
table that has to be referenced on every single LD/ST operation "is
this register being used, if so is it marked in the LD/ST table as
"BE" if so make this a BE-version of LD/ST".

all of which seems like an awful lot of effort to go to.... except
that it has other use-cases if generalised.

so anyway, just for completeness and/or consideration, whilst we're
still at the brainstorming phase. damn i should have kept a wiki page
with all this stuff in, too late now: 60 messages and climbing fast...
:)

l.

Iztok Jeras

unread,
Jun 13, 2018, 7:45:34 AM6/13/18
to RISC-V ISA Dev, allen...@esperantotech.com, sam....@gmail.com, tommy...@esperantotech.com, ces...@cesarb.eti.br
Hi,

I will add my stupid (probably, towards certainly) solution here.

It would be possible to construct 48bit instructions for BE operations LD/ST, that are a combination of:
1. 16 bit prefix indicating a 48bit BE instruction,
2. 32 bit for a normal LE LD/ST instruction.

The implementation could be simple enough, with minimal changes to instruction decoders.
The first part would set the load/store unit into big endian mode for the following transfer.

Defining it as a 48bit instruction is some kind of explicit fusing.
Context switching would not be able to affect proper execution.

On small 32bit systems with 32 bit instruction fetch,
the 16 and 32 bit parts could be executed as two separate instructions
with interrupts disabled between them.

Code density and execution time should not be impacted too much.
If C extension LD/ST instructions could be used for the second part,
code density would be even higher, but I could not think of a clean solution yet.

If only LD/ST operations are transformed into BE counterparts (relatively easy to implement and document),
then this scheme would not implement BSWAP.

We should probably not rush into the 48bit instruction space,
but to me this seems better then fusing an existing instruction with a new instruction (BSWAP).

Regards,
Iztok Jeras

Luke Kenneth Casson Leighton

unread,
Jun 13, 2018, 8:59:22 AM6/13/18
to Iztok Jeras, RISC-V ISA Dev, Allen Baum, Samuel Falvo II, Tommy Thorn, Cesar Eduardo Barros
On Wed, Jun 13, 2018 at 12:45 PM, Iztok Jeras <iztok...@gmail.com> wrote:
> Hi,
>
> I will add my stupid (probably, towards certainly) solution here.
>
> It would be possible to construct 48bit instructions for BE operations
> LD/ST, that are a combination of:
> 1. 16 bit prefix indicating a 48bit BE instruction,
> 2. 32 bit for a normal LE LD/ST instruction.

hmmm, ok let's think that through (someone correct me if this is an
incorrect analysis)

* a 16-bit prefix would imply that the "C" (compressed) encoding
would have to be used for the purpose.
* why "C"? because it's the only way to discern/distinguish 16-bit
from 32-bit, 48-bit or 64-bit encodings
* if C is not enabled, you're hosed because there's no decode-engine
in place to *decode* 16-bit. you could add one, conceivably, just to
recognise the (proposed) 16-bit prefix
* if C is enabled, it's essential to not clash *with* C, therefore
the proposed 16-bit prefix would have to *be* one of the (extremely
precious, last few remaining) C-encodings.

it would have to be something really *really* important, to take up
one of the last few C encoding spaces, saving N% of
some-resource-or-other, e.g. save 5% or greater instruction cache
usage on big-endian systems, something like that.

C is designed to compact instructions so that, by programs being
smaller, less resources are needed. i noticed in the spec V2.3-Draft
this exerpt:

"....The philosophy of RVC is to reduce code size for embedded
applications and to improve performance and energy-efficiency for all
applications due to fewer misses in the instruction cache. Waterman
shows that RVC fetches 25%-30% fewer instruction bits, which reduces
instruction cache misses by 20%-25%, or roughly the same performance
impact as doubling the instruction cache size [35]."

so anything that goes into C really does need to meet/match or exceed
those expectations... based on looking at *other potential uses* and
seeing which one(s) are best *not* just this particular one.

bottom line is, it's a good idea, that will need some serious, serious
numbers to justify *considering* it as a candidate. then i would
recommend *waiting* for some considerable time in case there are other
(better) candidates for using a C-encoding opcode space, *then*
evaluate the candidates side-by-side to see which is best.

l.

Clifford Wolf

unread,
Jun 13, 2018, 9:02:26 AM6/13/18
to iztok...@gmail.com, RISC-V ISA Dev, Allen Baum, Samuel Falvo II, Tommy Thorn, Cesar Eduardo Barros
Hi,

On Wed, Jun 13, 2018 at 1:45 PM Iztok Jeras <iztok...@gmail.com> wrote:
I will add my stupid (probably, towards certainly) solution here.

It would be possible to construct 48bit instructions for BE operations LD/ST, that are a combination of:
1. 16 bit prefix indicating a 48bit BE instruction,
2. 32 bit for a normal LE LD/ST instruction.

The implementation could be simple enough, with minimal changes to instruction decoders.
The first part would set the load/store unit into big endian mode for the following transfer.

What would be the advantage of that, other than that we can put off defining 48-bit instruction formats for a bit longer?

If it is two instructions architecturally, then we would need to specify what happens if the 16 bit prefix is used, but is not followed by a load/store instruction. Or what happens if the 16 bit prefix is the last data at the end of a page, and the following ld/st instruction is the first instruction on the next page, but accessing that next page produces a TLB miss. That likely means that the status flag that indicates that the next instruction is big endian would be something that can be saved and restored by an interrupt handler. (This is similar to the +/- 2GB interrupt addressing mode proposed by the fast interrupt group. But in this case I don't see the necessity to do it this way.)

If it is one 48 bit instruction architecturally then this complicates decoding and it would be more efficient to use a proper 48-bit instruction instead.

regards,
 - Clifford

Iztok Jeras

unread,
Jun 13, 2018, 10:52:44 AM6/13/18
to RISC-V ISA Dev, iztok...@gmail.com, allen...@esperantotech.com, sam....@gmail.com, tommy...@esperantotech.com, ces...@cesarb.eti.br
I meant actual 48 bit instructions not a pair of a 16 and a 32 bit instruction.
48bit instruction solutions were mentioned before, but without any encoding details.

The 16 bit prefix would be the LSB part of a normal 48bit instruction.
xxxxxxxx_xxxxxxxx__xxxxxxxx_xx011111
Another way to put it the 48 bit instruction would be a wrapper around a 32bit instruction.

The described structure (using the same encoding as with 32bit LD/ST instructions) was only intended to simplify implementations.
The encoding I proposed is probably the simplest possible.

Only if more compact code was desired, part of the 48bit instruction space could be reassigned to 32bit instructions,
then the C extension LD/ST instructions could be wrapped with a 16 bit prefix.

Regards,
Iztok Jeras

Luke Kenneth Casson Leighton

unread,
Jun 13, 2018, 12:39:45 PM6/13/18
to Iztok Jeras, RISC-V ISA Dev, Allen Baum, Samuel Falvo II, Tommy Thorn, Cesar Eduardo Barros
On Wed, Jun 13, 2018 at 3:52 PM, Iztok Jeras <iztok...@gmail.com> wrote:

> I meant actual 48 bit instructions not a pair of a 16 and a 32 bit
> instruction.
> 48bit instruction solutions were mentioned before, but without any encoding
> details.
>
> The 16 bit prefix would be the LSB part of a normal 48bit instruction.
> xxxxxxxx_xxxxxxxx__xxxxxxxx_xx011111
> Another way to put it the 48 bit instruction would be a wrapper around a
> 32bit instruction.

ok... so that would mean that the demuxer logic for
instruction-encoding formats, which has been carefully designed and
conforms to some specific formats so as to reduce the complexity of
the instruction decode phase, thus reducing latency so that that
particular pipeline stage stands a chance of being optimal... would
need to be special-cased to add in special support for a hybrid 16-32
format.

... can you see where that's going? :)

l.

Iztok Jeras

unread,
Jun 13, 2018, 3:50:58 PM6/13/18
to RISC-V ISA Dev, iztok...@gmail.com, allen...@esperantotech.com, sam....@gmail.com, tommy...@esperantotech.com, ces...@cesarb.eti.br
Sorry, I do not see your point.

As an example imagine how currently C extension are expanded (see pulp zero-riscy compressed decoder).
A few LSB bits of the instruction are used to detect if the instruction is 32 or 16 bit long.
If the instruction is 16 bit long then it is expanded to a 32bit instruction and fed into the normal instruction decoder.

Similarly for the proposed 48 bit instructions few LSB bits would be used to detect a 48bit instruction for big endian accesses.
Bits [47:16] can then be passed unmodified (no extra logic) into the normal instruction decoder.

The added complexity due to unaligned accesses would be similar to that of the C extension.

Detecting illegal instructions would become more complex.

Regards,
Iztok Jeras
 

l.

Luke Kenneth Casson Leighton

unread,
Jun 13, 2018, 4:05:04 PM6/13/18
to Iztok Jeras, RISC-V ISA Dev, Allen Baum, Samuel Falvo II, Tommy Thorn, Cesar Eduardo Barros
On Wed, Jun 13, 2018 at 8:50 PM, Iztok Jeras <iztok...@gmail.com> wrote:

> Sorry, I do not see your point.
>
> As an example imagine how currently C extension are expanded (see pulp
> zero-riscy compressed decoder).

i read the C-ext spec V2.3-Draft yesterday (to review the C.MV
stuff), so i remember bits of it clearly. the preamble says that the
encoding of C is very very specifically designed so that the muxer can
be implemented with the absolute minimum of latency.

immediate values for example are spread all over the place, in
different bits, because it reduces latency by keeping the mux-decoder
as simple as possible, i.e. *not* stepping outside the uniformity of
the 4 types of C-ext instructions.

i would hazard a guess that the same kinds of rules apply to the
48-bit encoding as to the 16-bit. making the muxer more complex by
having "special cases" such as the one that you are proposing would, i
suspectt, require careful evaluation.

basically it would be much simpler and less complexity of a critical
pipeline phase to just conform to the standard 48-bit encoding format
than it would to add *another* special encoding format.

for this reason i haven't made any suggestions of the type that you
made, despite seeing the benefit of the general idea, which i like.

l.

Cesar Eduardo Barros

unread,
Jun 13, 2018, 6:19:40 PM6/13/18
to Allen Baum, Samuel Falvo II, Tommy Thorn, RISC-V ISA Dev
Em 12-06-2018 22:41, Cesar Eduardo Barros escreveu:
> Em 08-06-2018 18:50, Allen Baum escreveu:
>> So 4 possible levels of support:
>>   - none
>>   - swap instructions
>>   - BigEndian load/store mode
>>   - BigEndian load/store instructions
>
> I'm a bit late to this discussion, so instead of replying to several
> subthreads, I'll try to summarize here several possibilities, many I've
> seen in this discussion and a few of my own. I apologize in advance for
> the length of this email message.
>
> I don't currently have an opinion on which is the best one, so I'll try
> to list the pros and cons of each one (help me find more). I'll also
> give each possibility a number to simplify further discussion.
>
> These options are not all exclusive, for instance we could have both 3a
> and 4b (in this case, 3a does the opposite of whichever the current mode
> is).
>
> These options are all about data access. RISC-V code is always in
> little-endian order. An alternative would be a "middle endian" order in
> which each 16-bit parcel is in big endian order, but to keep our sanity
> let's pretend that this possibility doesn't exist and that I didn't
> mention it.

So I thought about it a bit more, and I think there's two distinct use
cases for big-endian instructions on RISC-V:

- Porting software originally developed for a big-endian ISA
- Occasional use of data with the opposite endianness

The ideal solution for one of these use cases will not necessarily be
the ideal solution for the other one. Besides that, they're actually
orthogonal: one might be porting software originally developed for a
big-endian ISA which sometimes has to deal with little-endian data.

For the first use case, having big-endian accesses be slower or larger
will not be acceptable; however, making little-endian accesses slower is
acceptable, since nearly everything will be big-endian. Therefore, only
options 4b (global per-privilege flag making everything big-endian) and
3a (a full set of big-endian load and store instructions) are a good match.

However, option 3a is even more costly than I had thought: there are not
only normal loads and stores, but also LR/SC, atomic operations, and
floating-point loads and stores. Either the opcode space cost of this
option becomes too large, or big-endian is still a second-class citizen.

This leaves us with option 4b (set-and-forget "everything on this
privilege level is big-endian" flag; of course, the instruction fetch is
still little-endian.)

For the second use case, there are more options, since having the code
be a bit bigger or a bit slower when doing the opposite-endian access is
acceptable. My current favorite is either option 1a (a single
full-register BSWAP instruction, to be fused with a SRAI for smaller
swaps) or option 2a (a set of BSWAP instructions, one for each size,
macro-op fused with normal loads and stores; the half-word BSWAP might
be a pseudoinstruction mapping to a half-word rotate with immediate 8).

I don't know which of these two would be the best one. Option 1a is more
elegant, option 2a is faster.

Luke Kenneth Casson Leighton

unread,
Jun 13, 2018, 11:14:10 PM6/13/18
to Iztok Jeras, RISC-V ISA Dev, Allen Baum, Samuel Falvo II, Tommy Thorn, Cesar Eduardo Barros
On Wed, Jun 13, 2018 at 9:04 PM, Luke Kenneth Casson Leighton
<lk...@lkcl.net> wrote:

> i would hazard a guess that the same kinds of rules apply to the
> 48-bit encoding as to the 16-bit. making the muxer more complex by
> having "special cases" such as the one that you are proposing would, i
> suspectt, require careful evaluation.

correction, apologies, iztok, i realised i was probably talking
nonsense here as there's not been any standards-defined 48-bit
encoding so the argument doesn't apply. let me start again.

what *does* apply is as follows (see table 24.1 of V2.3-Draft ISA manual):

what you are proposing is effectively to use a 48-bit encoding to put
a full 32-bit instruction into it. encoding-within-an-encoding:

* a 48-bit encoding "prefix" takes up space to declare and
* a 32-bit encoding "prefix" takes up space to declare.

now, looking at the table it says that to get a full 32-bit of freedom
this is a "minor opcode in 48-bit space" and that there are 2^10
(1024) of these available.

that would therefore seem to make the suggestion you make be perfectly
reasonable.

l.

Christoph Hellwig

unread,
Jun 14, 2018, 5:04:33 AM6/14/18
to Cesar Eduardo Barros, Allen Baum, Samuel Falvo II, Tommy Thorn, RISC-V ISA Dev
On Wed, Jun 13, 2018 at 07:19:31PM -0300, Cesar Eduardo Barros wrote:
> cases for big-endian instructions on RISC-V:
>
> - Porting software originally developed for a big-endian ISA
> - Occasional use of data with the opposite endianness

Note that occasional above might also be frequent.

A lot of network protocols or on-disk formats are big endian, and we
deal with the all the time at least in systems level software.

Otherwise I agree with your writeup.

> For the second use case, there are more options, since having the code be a
> bit bigger or a bit slower when doing the opposite-endian access is
> acceptable. My current favorite is either option 1a (a single full-register
> BSWAP instruction, to be fused with a SRAI for smaller swaps) or option 2a
> (a set of BSWAP instructions, one for each size, macro-op fused with normal
> loads and stores; the half-word BSWAP might be a pseudoinstruction mapping
> to a half-word rotate with immediate 8).

Note that full register swaps might often not be the common case. Both
file system code are traditionally heavy on 16-bit and 32-bit values,
while I suspect most RISC-V general purpose CPUs will mostly run with
64-bit register width.

For example. here are the data structures of a very common Linux file
system, look for the __be* types, which do what what the names imply -
they are big endian values of 16/32/64-bit width:

https://github.com/torvalds/linux/blob/master/fs/xfs/libxfs/xfs_da_format.h

https://github.com/torvalds/linux/blob/master/fs/xfs/libxfs/xfs_format.h

Michael Chapman

unread,
Jun 14, 2018, 7:58:06 AM6/14/18
to Christoph Hellwig, Cesar Eduardo Barros, Allen Baum, Samuel Falvo II, Tommy Thorn, RISC-V ISA Dev
The way the __be* and __le* data types are used in the Linux kernel and
the definition of various byte swap functions in swab.h (and
architecture dependent variants) would suggest that the a byte swap
instruction on a full register would be the way to go and there is
little point in byte swapping load/store instructions and big endian
memory regions and very little to be gained overall by having byte swap
16 bits or 32 bits.

I think a single full register byte swap instruction will be plenty
efficient enough for all general purpose kind of RISC-V CPUs.

Gavin Stark

unread,
Jun 14, 2018, 9:45:39 AM6/14/18
to Michael Chapman, Christoph Hellwig, Cesar Eduardo Barros, Allen Baum, Samuel Falvo II, Tommy Thorn, RISC-V ISA Dev
I applaud looking at real software to help to see the benefits of alternatives.

Along these lines, and tying in with the comments from many about networking and how that is sometimes handled in an endian-aware manner: the Linux kernel is making more and more use for eBPF, particularly for networking.This includes software-defined networking, virtualised network interfaces, firewalling, NAT, etc.

eBPF in the kernel is supported through JIT compilation from eBPF virtual code to native instructions; for x86 this is all little-endian, of course, and indeed eBPF code is defined to be the endianness of the host it is on.

The point being, that if folks are interested in the benefits of various options for managing endianness, particularly with regard to networking and Linux (and this should be part of any quantification effort for new instructions) then I would encourage looking at source code for eBPF applications. Note further that a good chunk of the source for this is in C, and is compiled with LLVM using the ebpf target.

—Gavin
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/63fa639f-d8db-833b-b2f6-0a4333beaa5b%40gmail.com.

Allen Baum

unread,
Jun 14, 2018, 10:03:48 AM6/14/18
to Gavin Stark, Michael Chapman, Christoph Hellwig, Cesar Eduardo Barros, Samuel Falvo II, Tommy Thorn, RISC-V ISA Dev
Going back to the original email that spawned this: this was a reaction to comments from (more than one) person that indicated that the most popular microcontroller in Japan was Big Endian, and that a LOT of existing apps were written for them- and the only porting of them that could be expected was a recompilation for RiscV.
I would not assume RV64 will be more popular than RV32 for those apps- I just don't know.
It also isn't clear to me that ultimate performance is a goal- it just mustn't suck badly, but I could be wrong ( and just running on a newer architecture+process might be enough).
Since there are libraries for network code on little-endian architectures, they don't have to be rewritten, at least.

I don't know the prevalence of signed quantities in this code. If rare, the problem is easier.

The byte swap+byte/halfword/word/double-sided open sounds like the best overall choice to me. Macrofusing could make it even more appealing for architectures that need that extra little bit of performance.

-Allen

Luke Kenneth Casson Leighton

unread,
Jun 14, 2018, 10:38:56 AM6/14/18
to Allen Baum, Gavin Stark, Michael Chapman, Christoph Hellwig, Cesar Eduardo Barros, Samuel Falvo II, Tommy Thorn, RISC-V ISA Dev
On Thu, Jun 14, 2018 at 3:03 PM, Allen Baum
<allen...@esperantotech.com> wrote:

> Going back to the original email that spawned this: this was a reaction to comments from (more than one) person that indicated that the most popular microcontroller in Japan was Big Endian, and that a LOT of existing apps were written for them- and the only porting of them that could be expected was a recompilation for RiscV.

there's also a mission-critical scenario in india (nuclear reactors
running VMEbus 68020 systems) where despite the code being very small
they absolutely will not rewrite it, period, due to it being a
compound-change (change in code, change in hardware) and that is
absolutely impossible to safely validate.

so that we do not end up with radioactive dust getting into our
planet's atmosphere the whole system therefore has to be as
near-identical to the 68020 VME system it is replacing as it is
possible to get. a FlexBus interface covers VME: beyond that i
honestly don't know if they'd be happy with software-emulation or
require strict single-cycle hardware support for LD.BE/ST.BE - i would
be extremely wary of anything that wasn't as close to what the 68020
used to do.

l.
It is loading more messages.
0 new messages