Bit Manipulation (decoding immediate)

Michael Clark

unread,

Mar 12, 2016, 2:28:38 AM3/12/16

to RISC-V ISA Specification Discussion

Hi All,

Sharing DRAFT notes on bit manipulation; provoked by a polymorphic software decoder for RISC-V. After looking at the Bit Manipulation section of the ISA Specification I became interested in instructions that could accelerate software decoding of RISC-V instructions, such as would be required in an illegal instruction handler or a simulator / binary translator. e.g. RISC-V on RISC-V.

~mc

Hypothetical Bit Manipulation Instructions

* bclz - Count Leading Zeros (w,d,q) Count leading zero bits
* bctz - Count Trailing Zeros (w,d,q) Count trailing zero bits
* bcsp - Count Set Population (w,d,q) Count set population
* bsrm - Shift Right and Mask (w,d,q) C expression: (src >> start) & ((1 << masklen)-1)
* bglb - Get Lowest Bit (w,d,q) C expression: x & -x
* bclb - Clear Lowest Bit (w,d,q) C expression: x & (x - 1)
* bmlb - Mask to Lowest Bit (w,d,q) C expression: x ^ (x - 1)
* bgs - Gather Scatter (w,d,q) Gather and scatter bits

RISC-V Instruction Decoding

Bit Manipulation Instructions would accelerate software decoding of RISC-V instructions. Specifically a bit gather scatter instruction that could be used in illegal instruction exception handlers for faster instruction emulation. Based on some analysis of software-based RISC-V instruction decoding I believe we could benefit from two new instructions:

* bsrm - Shift Right and Mask
* bgs - Gather Scatter

Notes on instruction decoding

* Registers can be extracted with a shift and mask instruction (src >> start) & ((1 << masklen)-1)
* Immediate decoding would benefit from a bit gather scatter instruction
* Typical RISC-V immediate is 16 bits or less
* Atypical immediate is 21 bits (UJ-type)

Notes on bit offset packing for bit gather scatter

* 16, 32 and 64 bits requires 4, 5 and 6 bits to express a bit offset respectively
* 32, 64 and 128 bit words can contain 8, 16 and 32 x 4 bit offsets respectively
* 64-bit word (or 2 x 32 bit regs) can contain 12 x 5 bit offsets (ugly)
* 64-bit word (or 2 x 32 bit regs) can contain 10 x 6 bit offsets (ugly)
* 128-bit word (or 2 x 64 bit regs) can contain 25 x 5 bit offsets (ugly)
* 128-bit word (or 2 x 64 bit regs) can contain 21 x 6 bit offsets (ugly)

Bit Gather Scatter

BGS (Bit Gather Scatter) uses 4 bit packed offsets in a gather scatter mask register (rs2) to gather bits from a specified half or quarter of a source register (rs1) and write bits to a half or quarter of a destination register (rd). Gather bit offsets in the source register are based on the right to left position of the 4 bit packed offset relative to the source half or quarter. Scatter bit offsets in the destination register are based on the packed bit offset value relative to the destination half or quarter.

# RV32B
bgs.w
32-bit word can contain 8 x 4 bit offsets (half 32-bit word)
rd - bit gather scatter destination
rs1 - bit gather scatter source
rs2 - bit gather scatter mask
1 funct bits for source half
1 funct bits for destination half

# RV64B
bgs.d
64-bit word can contain 16 x 4 bit offsets (quarter 64-bit double)
rd - bit gather scatter destination
rs1 - bit gather scatter source
rs2 - bit gather scatter mask
2 funct bits for source quarter
2 funct bits for destination quarter

signature.asc

Samuel Falvo II

unread,

Mar 12, 2016, 12:32:31 PM3/12/16

to Michael Clark, RISC-V ISA Specification Discussion

On Fri, Mar 11, 2016 at 11:28 PM, Michael Clark <michae...@mac.com> wrote:
> Hypothetical Bit Manipulation Instructions
>
> * bclz - Count Leading Zeros (w,d,q) Count leading zero bits
> * bctz - Count Trailing Zeros (w,d,q) Count trailing zero bits
> * bcsp - Count Set Population (w,d,q) Count set population
> * bsrm - Shift Right and Mask (w,d,q) C expression: (src >> start) & ((1 << masklen)-1)
> * bglb - Get Lowest Bit (w,d,q) C expression: x & -x
> * bclb - Clear Lowest Bit (w,d,q) C expression: x & (x - 1)
> * bmlb - Mask to Lowest Bit (w,d,q) C expression: x ^ (x - 1)
> * bgs - Gather Scatter (w,d,q) Gather and scatter bits

Considering how often AND NOT appears in the logic of software,
particularly when manipulating graphics, clearing specific device
control register bits, or using bit masks to implement a set datatype
(e.g., Oberon or Modula-2's SET type), I think it should be a
consideration to include an "and not" instruction. Otherwise, a
processor implementation that targets these areas will need to
consider interpreting XORI rA, rA, -1; AND rD, rB, rA as a bonded or
macro instruction to execute in a single cycle.

> BGS (Bit Gather Scatter) uses 4 bit packed offsets in a gather scatter mask register (rs2) to gather bits from a specified half or quarter of a source register (rs1) and write bits to a half or quarter of a destination register (rd). Gather bit offsets in the source register are based on the right to left position of the 4 bit packed offset relative to the source half or quarter. Scatter bit offsets in the destination register are based on the packed bit offset value relative to the destination half or quarter.

I've read through this many times, and I'm still thoroughly confused
by your scatter/gather notes. Can you discuss the specific problem
you're trying to solve, and how your proposed solution aims to help?
Can you include example code to help illustrate?

--
Samuel A. Falvo II

Mark Friedenbach

unread,

Mar 12, 2016, 12:37:21 PM3/12/16

to Michael Clark, RISC-V ISA Specification Discussion

As a general note about the bit extension ISA, which I am very interested in: it will be made less contentious, and therefore more easily and more quickly adopted if we focus on just those instructions which do not have efficient decompositions into a small number of base instructions.

For example counting leading / trailing zeroes and popcount both require a loop (possibly unrolled) to emulate, whereas the others you enumerate are somewhat trivially decomposed, as your C equivalent examples show.

That said, you present a convincing use case for gather scatter -- instruction decode on emulation traps should be very fast and is probably worth adding an instruction, even if it is decomposable...

On Fri, Mar 11, 2016 at 11:28 PM, Michael Clark <michae...@mac.com> wrote:

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/D8795FF3-462F-4E17-9B36-57F6CDA38919%40mac.com.

Tommy Thorn

unread,

Mar 12, 2016, 1:15:47 PM3/12/16

to RISC-V ISA Specification Discussion

One of the wonderful qualities of the RISC-V ISA is the lack of cruft; only
instructions that has quantitive merits are included. As the B-set isn’t
a mandated feature, to maximize the chance of getting it adopted by
implementations, it must have a compelling benefit/cost ratio.

It was in that spirit we last discussed the B-set and I proposed a B-lite
set. Only instructions that cannot be implement in ~ 4 of fewer instructions
were included:

The three Alpha 21264A (EV67) instructions

Mnemonic Instruction
CTLZ Count Leading Zero
CTPOP Count Population
CTTZ Count Trailing Zero

as well as

BSWAP Swap the byte in a word.

Note, there no 16-bit version as it’s only three instructions.

BSWAP is important for networking code and all code that has to convert bigendian
data and I can’t imagine this to be controversial.

The bit counting instructions are critical in many memory allocations routines and
data structures. Fx. the Linux kernel has many instances of it. Still I owe some
hard data on real-world impact.

Thanks
Tommy

Michael Clark

unread,

Mar 12, 2016, 2:34:07 PM3/12/16

to Samuel Falvo II, RISC-V ISA Specification Discussion

On 13 Mar 2016, at 6:32 AM, Samuel Falvo II <sam....@gmail.com> wrote:

BGS (Bit Gather Scatter) uses 4 bit packed offsets in a gather scatter mask register (rs2) to gather bits from a specified half or quarter of a source register (rs1) and write bits to a half or quarter of a destination register (rd). Gather bit offsets in the source register are based on the right to left position of the 4 bit packed offset relative to the source half or quarter. Scatter bit offsets in the destination register are based on the packed bit offset value relative to the destination half or quarter.

I've read through this many times, and I'm still thoroughly confused
by your scatter/gather notes. Can you discuss the specific problem
you're trying to solve, and how your proposed solution aims to help?
Can you include example code to help illustrate?

I think a diagram is needed to make it easier to understand, especially showing each half of a 32-bit word and each quarter of a 64-bit word:

* halves: (h1, h0) = [31:16], [15:0]

* quarters: (q3, q2, q1, q0) = [63:48], [47:32], [31:16], [15:0]

I have made two examples of decoding the immediate for CJ-Type, one using Bit Gather Scatter and one using RV64I. The gather scatter data below references the lower 16-bits (15:0) of a 64-bit register containing an instruction to decode, or q0 (quarter 0) in the hypothetical instruction notation. I realise the notation for halves and quarters needs to be more clearly worded (and could perhaps be indicated differently). Likewise the semantics would need to be defined for the case where the same scatter bit is referenced multiple times as is bit 0 in the example.

These are the two examples:

* decode_cj_bgs - 7 instructions plus one constant

* decode_cj_rv64i - 26 instructions generated by gcc

Here is the assembler:

cj_data:

# define 16-bits gather scatter data for CJ-Type (12:2) offset[11|4|9:8|10|6|7|3:1|5]

.double 0x000B498A67321500

decode_cj_bgs:

# load address of gather scatter data

auipc x1,cj_data(hi)

addi x1,x1,cj_data(lo)

# load gather scatter data to x2

ld x2,0(x1)

# clear destination

li x3,0

# gather from lower quarter (15:0) of x0, scatter to lower quarter (15:0) of x3

bgs.d.q0.q0 x3,x0,x2

andi x3,0b111111111110 # remove bit 0 dumping ground

ret

decode_cj_rv64i:

srl a4,a0,7
and a3,a4,16
li a4,4096
srl a5,a0,1
add a4,a4,-2048
and a4,a5,a4
and a2,a5,768
or a4,a3,a4
all a3,a0,2
or a4,a4,a2
and a3,a3,1024
or a4,a4,a3
and a5,a5,64
sll a3,a0,1
or a5,a4,a5
and a3,a3,128
srl a4,a0,2
or a5,a5,a3
and a4,a4,14
sll a0,a0,3
or a5,a5,a4
and a0,a0,32
or a0,a5,a0
sll a0,a0,52
sra a0,a0,52
ret

signature.asc

Michael Clark

unread,

Mar 12, 2016, 2:42:35 PM3/12/16

to Tommy Thorn, RISC-V ISA Specification Discussion

Maybe this is a better list then:

CTLZ - Count Leading Zeros (w,d,q) Count leading zero bits

CTTZ - Count Trailing Zeros (w,d,q) Count trailing zero bits

CTPOP - Count Population (w,d,q) Count set population

SRLM - Shift Right Logical and Mask (w,d,q) C expression: (src >> start) & ((1 << masklen)-1)

BGS - Gather Scatter (w,d,q) Gather and scatter bits

I have included the SRLM as I think it may be complex enough on a single issue design such that it would represent a reasonable saving of cycles. There may be a benefit from embedding the two immediates. We would have to write the asm to see how much of a win it is.

signature.asc

Michael Clark

unread,

Mar 12, 2016, 2:48:52 PM3/12/16

to Tommy Thorn, RISC-V ISA Specification Discussion

Forgot BSWAP

CTLZ - Count Leading Zeros (w,d,q) Count leading zero bits

CTTZ - Count Trailing Zeros (w,d,q) Count trailing zero bits

CTPOP - Count Population (w,d,q) Count set population

SRLM - Shift Right Logical and Mask (w,d,q) C expression: (src >> start) & ((1 << masklen)-1)

BGS - Bit Gather Scatter (w,d,q) Gather and scatter bits

BSWAP - Byte Swap (w,d,q) Swap the bytes in a word.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/640902BD-45D9-499C-A289-EE90F8787B75%40mac.com.

signature.asc

Mike Hamburg

unread,

Mar 12, 2016, 3:07:11 PM3/12/16

to Michael Clark, Tommy Thorn, RISC-V ISA Specification Discussion

Is rotate insufficiently orthogonal? It’s used a lot in crypto, but maybe not in anything else?

I’m also confused about SRLM. Is this to avoid a shift right followed by and/andi?

— Mike

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/AED4B26D-2BCE-40C7-B2A6-2018D7C52626%40mac.com.

Michael Clark

unread,

Mar 12, 2016, 3:50:28 PM3/12/16

to Mike Hamburg, Tommy Thorn, RISC-V ISA Specification Discussion

On 13 Mar 2016, at 9:07 AM, Mike Hamburg <mi...@shiftleft.org> wrote:

Is rotate insufficiently orthogonal? It’s used a lot in crypto, but maybe not in anything else?

It’s not emitted by C compilers. It might be useful in conjunction with BGS e.g. allow elimination of the half and quarter at the expense of an additional instruction to shift bits into the low [15:0] bits.

uint32_t rotl32 (uint32_t value, unsigned int count) {

return (value<<count) | (value>>( (-count) & 0b11111 ));

}

uint32_t rotr32 (uint32_t value, unsigned int count) {

return (value>>count) | (value<<( (-count) & 0b11111 ));

}

I’m also confused about SRLM. Is this to avoid a shift right followed by and/andi?

Yes you are right and it’s not worth it on reflection. Likewise we could make do with a single rotate if it was to be added.

CTLZ - Count Leading Zeros (w,d,q) Count leading zero bits

CTTZ - Count Trailing Zeros (w,d,q) Count trailing zero bits

CTPOP - Count Population (w,d,q) Count set population

BGS - Bit Gather Scatter (w,d,q) Gather and scatter bits

BSWAP - Byte Swap (w,d,q) Swap the bytes in a word

RRL - Rotate Right Logical (w,d,q) Rotate the bits in a word

RRLI - Rotate Right Logical Immediate (w,d,q)

I will do some more research on Bit Gather Scatter (*1)

[1] http://palms.ee.princeton.edu/system/files/Hilewitz_JSPS_08.pdf

Alex Song

unread,

Dec 27, 2016, 11:26:28 PM12/27/16

to RISC-V ISA Dev

Hate to be a bother, but I was curious if there has been any movement on this behind the scenes.

-AS

Michael Clark

unread,

Dec 27, 2016, 11:42:14 PM12/27/16

to Alex Song, RISC-V ISA Dev

I should perhaps spend more time on Chisel/Verilog. I can learn :-D

I compiled a list:

- https://github.com/michaeljclark/riscv-meta/blob/master/doc/src/bmi.md

Instructions that encode to less than 4 instructions have been removed as discussed. Many of the simple C expressions below have been removed. Rotate is relatively simple but very common in hash algorithms and ciphers so it is worth keeping. I've also listed instructions that would aid in decode of RISC-V instructions. BITEXTEND allows arbitrary sign or zero extension and it takes a surprising amount of machine code. I haven’t seen it in any BMI set. It can be used to zero or sign extend. I have software versions for all except parallel bit extract and parallel bit deposit, but no opcodes…

I should load the most recent lowRISC update onto my KC705 and could try some HDL hacking perhaps.

~mc

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/61d74fa3-a18d-4c6c-ae09-64c4e31f0ec6%40groups.riscv.org.

Samuel Falvo II

unread,

Dec 28, 2016, 12:14:06 AM12/28/16

to Michael Clark, Alex Song, RISC-V ISA Dev

On Tue, Dec 27, 2016 at 8:42 PM, Michael Clark <michae...@mac.com> wrote:
> worth keeping. I've also listed instructions that would aid in decode of
> RISC-V instructions.

It's a pity that PBE and PBD are only of limited value for instruction
unpacking, as the UJ and SB displacements are not contiguous in their
bit allotments. SB can be fixed easily enough with a table-lookup,
but UJ will require some more massaging. Have you thought about how
PBE can be used for decoding a UJ dislpacement?

Michael Clark

unread,

Dec 28, 2016, 12:44:06 AM12/28/16

to Samuel Falvo II, Alex Song, RISC-V ISA Dev

Yes Parallel Extract and Deposit instructions are not optimal for RISC-V instruction decoding/encoding due to more than one crossing, however they will certainly reduce the instruction count. I should quantify this. Towers of Hanoi. It's not that the bits are not contiguous (as this is intrinsically handled) but it's the number of twists. I think there is at most two passes to extract the immediate in the Base ISA.

The Princeton paper talks about a method using butterfly networks for handling arbitrary bit permutations but it seems to be very complex, and requires quite specialist programming to make use of it so I didn't include it.

Sent from my iPhone

> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAEz%3DsomxL5qB4d13k%3DVY03_mCAzyE14m6757TEm76MkJX7J%2B-Q%40mail.gmail.com.

Michael Clark

unread,

Nov 12, 2017, 4:35:32 PM11/12/17

to Samuel Falvo II, RISC-V ISA Specification Discussion

> On 13/03/2016, at 8:33 AM, Michael Clark <michae...@mac.com> wrote:
>
>> On 13 Mar 2016, at 6:32 AM, Samuel Falvo II <sam....@gmail.com> wrote:
>>
>>> BGS (Bit Gather Scatter) uses 4 bit packed offsets in a gather scatter mask register (rs2) to gather bits from a specified half or quarter of a source register (rs1) and write bits to a half or quarter of a destination register (rd). Gather bit offsets in the source register are based on the right to left position of the 4 bit packed offset relative to the source half or quarter. Scatter bit offsets in the destination register are based on the packed bit offset value relative to the destination half or quarter.
>>
>> I've read through this many times, and I'm still thoroughly confused
>> by your scatter/gather notes. Can you discuss the specific problem
>> you're trying to solve, and how your proposed solution aims to help?
>> Can you include example code to help illustrate?

It turns out the instruction I was trying to describe is appearing in a future Intel processor as part of the AVX512_BITALG extension. Possibly Icelake.

VPSHUFBITQMB — Shuffle Bits from Quadword Elements Using Byte Indexes into Mask

A 512-bit vector contains control word with 64 x 8-bit values (of which only 6 bits are significant, as each element selects a bit in the source data) containing the bit offsets to be gathered from the source register with the result written to a destination register (k register in the case of the Intel instruction).

BGS can do more work than BEXT/BDEP, which requires multiple iterations if bits need to cross over, however BGS requires more control data, hence my description was limited to arbitrary 16-bit permutations.

BGS (or Intel’s VPSHUFBITQMB) as I described it uses 4 bit packed offsets in a 64-bit control word to perform an arbitrary bit permutation for a 16-bit word. The Intel instruction uses a 512-bit control word with 64 6-bit offsets to gather from a 64-bit word. There are also a number of instructions for operating on Galois Fields (GF-NI)

- https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf

In any case my example which pre-dates the Intel document shows the use of a 64-bit vector containing 16 - 4-bit offsets for arbitrary bit permutations.

* decode_cj_bgs - 7 instructions plus one constant
* decode_cj_rv64i - 26 instructions generated by gcc

Here is the assembler from my original message:

(reading the RISC-V CJ-Type encoding and looking at the double nibbles below gives you some idea how it works)

signature.asc

Michael Clark

unread,

Nov 12, 2017, 4:42:22 PM11/12/17

to Samuel Falvo II, Clifford Wolf, RISC-V ISA Specification Discussion

We’d need the V extension with a 64 element control vector of bytes containing 6-bit offsets to perform an arbitrary bit permutation on a 64-bit word.

Of course there are more compact encodings as once a bit has been selected, there are less remaining bits that can be described if each bit only appears in the destination word once. Clifford Wolf has done the math for this. A 512-bit (64 x 8bit using only 6bits) vector is quite expensive, however an arbitrary big gather with multiple bit crossings can be specified in one instruction, as per my attempt at a BGS description below (which was working on 16-bit values with 4-bit offsets).

At least we have prior art for VPSHUFBITQMB on the RISC-V mailing list (Mar 2016), which pre-dates Intel’s public document (Oct 2017).

The only difference is control word size.

signature.asc

Message has been deleted

lk...@lkcl.net

unread,

Apr 2, 2018, 8:26:59 AM4/2/18

to RISC-V ISA Dev, tommy...@thorn.ws

On Saturday, March 12, 2016 at 6:15:47 PM UTC, Tommy Thorn wrote:

One of the wonderful qualities of the RISC-V ISA is the lack of cruft; only
instructions that has quantitive merits are included. As the B-set isn’t
a mandated feature, to maximize the chance of getting it adopted by
implementations, it must have a compelling benefit/cost ratio.

ok so in a separate thread, related to 3D Graphics and Video processing, i mentioned an idea for a Simple-Vector extension which boils down to a single instruction: set a CSR which says whether a register becomes a vector or not, such that ALU operations on e.g. register0 are now *simultaneously* executed on register1,2,3,4, whatever the length of the vector is set to. length 1 indicates "back to normal" (hence why it's only a single instruction).

the proposal also said, "*any* ALU instructions where the register is marked as a vector become vectors" - mandatory. no exceptions. and that would also include the B-set.

however i'm now wondering, how the heck would B actually work, in such a vector context?

l.

lk...@lkcl.net

unread,

Apr 2, 2018, 9:51:14 AM4/2/18

to RISC-V ISA Dev, sam....@gmail.com, alex.l...@gmail.com

On Wednesday, December 28, 2016 at 5:44:06 AM UTC, Michael Clark wrote:

The Princeton paper talks about a method using butterfly networks for handling arbitrary bit permutations but it seems to be very complex,

it isn't. i made a study of butterfly networks at imperial college: if you have two mirrored back-to-back you can get *guaranteed* 100% throughput on *all* possible permutations of inputs-to-outputs [as long as no two inputs wish to route to the same output]. router settling time is O(log(N)) or something like that, and it's equivalent to a full arbitrarily-sized crossbar whilst not actually requiring anyhting like the amount of gates *of* a full crossbar.

see the following image if you're not familar with them: https://i.stack.imgur.com/N1vxo.png or look on wikipedia butterfly network page if that image ever disappears.

so its simplicity comes down to the fact that it's a staggered array of 2x2 crossbars. the basic principle of the butterfly network is that you take the source address, XOR it with the destination address, and that is *LITERALLY* - in bit-order - the settings for each 2x2 crossbar in each row. that's it. that's *LITERALLY* all there is to it. if you follow the XORing rule you WILL route source to destination, for all possible combinations of source->destination.

so if you have 3-bit addresses 011 and 101, you will need 3 rows in the butterfly network. you XOR the two adresses together to get 110, then you take bit 0 to set the first router's 2x2 crossbar (which in this case is 0 so it's not crossed), then bit 1 for the second row's 2x2 crossbar, that's crossed in this case and likewise bit 2 is also crossed.

hmmm.... somewhere, someone will know a sort algorithm that fits precisely onto a butterfly network....

https://stackoverflow.com/questions/20042184/butterfly-network-in-sorting

https://www.researchgate.net/publication/266629590_Parallel_Butterfly_Sorting_Algorithm_on_GPU [1]

going back to the original problem of doing arbitrary bit-swapping, the problem that you have is NOT specific to butterfly networks, it's a problem for ALL bit-swapping algorithms: the sheer overwhelming number of bit permutations in say a 32->32... that's... er... it's 32 factorial, isn't it? something like that. and that's insane. only made worse with 64-bit. so whatever is done i *really* do recommend subsetting.

apologies i'm thinking out loud here, i have a feeling that if you were to implement an instruction which effectively implemented a *single row* of any given butterfly network (where you could specify, in the instruction, which row you wanted to "apply" to the bits (errr... see diagram above but mentally turn it... mmm.... clockwise by 90 degrees), you could do a full any-in to any-out of a 32 bit number in ... 0 1 2 3 2 1 0 so 7 instruction cycles. some of those combinations would actually implement full byte-swapping, *and* you could implement arbitrary bit-shifts, albeit requiring several instructions to do them.

interestingly you would only need 16 bits for doing 32-bit bit-swapping because you're setting 2x2 crossbars. soo... in theeorryyyy... you could actually use two halves of one 32-bit register to do *two* simultaneous swappings with just the one instruction. effectively a 4x4 crossbar.

but before exploring that tangent allow me to describe the instruction better, with examples. assume 32-bit data.

bit-swap-me {butterflyrow:0-3} {register to use as source for crossbar-settings:15-0}

(maybe a register variant of this? so *register* to use as butterflyrow, use bits 0-3 of register)

let's say r0=0110110100001111 that's 16 bits right? and we set butterflyrow=0

apply that to source register 00 01 11 01 01 01 11 00 11 01 00 01 11 01 00 01

i've split it by bitpairs deliberately, you'll see why:

so basically because we set brow = 0, the bits (in order) from r0 specify whether to swap *bit-pair's* bits in the source register, and we get:

00 10 11 01 10 10 darn this is hard, trying to point 2 fingers at two places on the screen at once... you get the idea anyway :)

so now let's take the same example and set butterflyrow=1. now that's swapping *nibble* bits... so let's put the source register into nibbles:

0001 1101 0101 1100 1101 0001 1101 0001

and the output would be: take 1st two nibbles and first 4 bits of r0, do swaps

0101 1001

so, expressing that programmatically:

if r0[0] == 1:

swap(o[0], o[4])

swap(o[1], o[5])

swap(o[2], o[6])

swap(o[3], o[7])

if r0[1] == 1:

swap(o[8], o[12])

swap(o[9], o[13])

swap(o[10], o[14])

swap(o[11], o[15])

and so on. likewise when brow=2 the swap-gap jumps to 8, and when brow=3 the swap-gap jumps to 16.

anyway brow=3 is basically where you can emulate full half-word-swapping. top 16 bits interchanged with bottom 16 bits by setting brow=3, r0 = 1111111111111111 - very simple. that says "please swap every bit spaced 16 apart". end result? top half is swapped with bottom half of the 32-bit number.

if you want to do full byte-swapping (any byte to any byte) you do it in 3 instructions, brow=2, brow=3, brow=2. first brow=2 is used to swap bytes 0 and 1, and bytes 2 and 3. then brow=3 is used to swap bytes 0 and 2, and bytes 1 and 3, and finally the third use brow=2 again swaps 0+1 and 2+3.

the only thing is (and this is not a huge down-side, just an observation) there are actually quite a lot of possible permutations, the study that i did concluded that in each row you have *two* possible ways of "routing" bits, because you can either do the swap with the first brow=2 and leave it on the second, *or* you could leave it on the first brow=2 and do the swap on the second.

that *is* where it gets complex, michael, when you start to study a butterfly network in detail. the only reason i understand (and trust) the logic/math is because i studied them for months, back in 1991, and have been sort-of "keeping it in the back of my mind" ever since. normally my brain would be utterly and completely melted by the set theory behind this stuff.

now, like i mentioned already (getting ahead of myself) because you have used 16-bits as the settings on the crossbar [bit-]switches, you *could* use the *second* (top) 16-bits to specify a *second* switch. implementors would be free to choose whether to do the (rather complex as it turns out) analysis of whether to turn that into a 4x4 crossbar, or whether to apply the bit-switches sequentially (2 clock cycles), or whether (much simpler) to just duplicate the 2x2 crossbar logic and run the bits through two bit-swapping-networks back-to-back.

if this idea is to be taken seriously, what i'd suggest is, doing a proper study, writing an algorithm that auto-generates the cross-bar settings for certain commonly-used operations (bear in mind that such an algorithm would need to be part of gcc and any other compiler). arbitrary-offset barrel-shifting, byte-swapping, nibble-swapping, etc. etc. just bearing in mind that worst-case for 32-bit operands it could take up to *seven* instructions for the more in-depth cases. that reduces to 4 if the double-swapping idea is incorporated.

however.... the idea does have one big, big advantage: it takes care of *ALL* and *ANY* possible instructions involving permutational bitwise reordering, and expands conceptually very easily to 64-bit and even 128-bit... all covered with a *single instruction*. and i do mean all of them.

l.

[1] interesting. that GPU paper screwed up. how come i studied this in 1990 and realised that you just have to have two butterfly networks mirrored back-to-back (swap bits spaced by 1 then 2 then 4 then 8 then back to 4 then 2 then 1), and these people in *2013* present an *inefficient* algorithm that does 1 then 2 then 1 then 4 then 2 then 1 then 8 then 4 then 2 then 1, i mean moo? wtf?? *sigh*....

lk...@lkcl.net

unread,

Apr 2, 2018, 10:35:45 AM4/2/18

to RISC-V ISA Dev, sam....@gmail.com, clif...@clifford.at

On Sunday, November 12, 2017 at 9:42:22 PM UTC, Michael Clark wrote:

We’d need the V extension with a 64 element control vector of bytes containing 6-bit offsets to perform an arbitrary bit permutation on a 64-bit word.

whereas the proposed single-butterfly-row bit-swapping instruction (and its proposed two-separate-rows-in-a-single-instruction enhancement/optimisation) would not. in the 64-bit case it would need to be able to specify a butterfly-row index between 0 and 4 (3 bits needed for that), and would require 32-bits of its "crossbar-setting" register-operand in order to swap 2-bit pairs of a 64-bit source.

Of course there are more compact encodings as once a bit has been selected, there are less remaining bits that can be described if each bit only appears in the destination word once. Clifford Wolf has done the math for this.

clifford, michael, anyone else: would anyone be prepared to do a similar analysis for the proposed single-butterfly-row bit-swapping instruction, assuming that the overhead of needing multiple instructions would be acceptable to cover all possible bit-wise permutations (9 instructions in the case of a 64-bit source: row indices 012343210, 7 for 32-bit sources: row indices 0123210, 5 for 16-bit or for arbitrary swapping of nibbles in 64-bit numbers... so many combinations, all covered by a single instruction!)

.... the BGS / VPSHUFBITQMB would also need multiple instructions, right?

A 512-bit (64 x 8bit using only 6bits) vector is quite expensive, however an arbitrary big gather with multiple bit crossings can be specified in one instruction, as per my attempt at a BGS description below (which was working on 16-bit values with 4-bit offsets).

At least we have prior art for VPSHUFBITQMB on the RISC-V mailing list (Mar 2016), which pre-dates Intel’s public document (Oct 2017).

publicly-published implementations of butterfly networks date back (at leasst to my knowledge) to 1983: the ALICE Network (Inmos Transputers... anyone remember those?), a joint project between Manchester University, Plessey, and Imperial College. mentioning this because there's *no way* anyone's getting a valid patent on this one.

l.

lk...@lkcl.net

unread,

Apr 2, 2018, 10:54:22 AM4/2/18

to RISC-V ISA Dev, tommy...@thorn.ws

On Monday, April 2, 2018 at 1:26:59 PM UTC+1, lk...@lkcl.net wrote:

ok so in a separate thread, related to 3D Graphics and Video processing, i mentioned an idea for a Simple-Vector extension which boils down to a single instruction:

cross-reference for archive search/navigation purposes https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/GuukrSjgBH8

Clifford Wolf

unread,

Apr 2, 2018, 11:58:15 AM4/2/18

to lk...@lkcl.net, RISC-V ISA Dev, sam....@gmail.com

Hi,

On Mon, Apr 02, 2018 at 07:35:45AM -0700, lk...@lkcl.net wrote:
> clifford, michael, anyone else: would anyone be prepared to do a similar
> analysis for the proposed single-butterfly-row bit-swapping instruction,
> assuming that the overhead of needing multiple instructions would be
> acceptable to cover all possible bit-wise permutations (9 instructions in
> the case of a 64-bit source: row indices 012343210, 7 for 32-bit sources:
> row indices 0123210, 5 for 16-bit or for arbitrary swapping of nibbles in
> 64-bit numbers... so many combinations, all covered by a single
> instruction!)

When I looked into BEXT/BDEP (PEXT/PDEP) I thought a lot about generic
butterfly opcodes and ultimately discarded the idea for a number of
reasons, but mostly because of the large amount of control words
neccessary to perform an operation. It does not matter if you pack the
entire config into a vector register, put it into special CSRs (as
[hilewitz06] proposes), or load them one at a time into GPRs and use them
with a single-butterfly-row instruction (as is proposed in this thread if I
understand it correctly.)

Assume you want to perform an arbitrary bit swap on a 64 bit word. That
would require a forward butterfly (with log2(64)=6 rows) and a backward
butterfly with another 5 rows. (the last row of the fwd butterfly and
the first row of the inverse butterfly can be shared.)

Using a LUI and an ADDI instruction to load each of those 11 32-bit
constants this means we need 3*11=33 instructions for an arbitrary
permutation. This is about equal to the experimental results I got for
synthesizing sequences of GREVI, ROT, BEXT, and BDEP to implement
arbitrary bit permutations.

Using e.g. memory loads instead of LUI+ADDI will not really change the
performance significantly. It just exchanges icache utilization by dcache
utilization.

Packing the 11 32-bit constants into 6 64-bit registers would reduce the
enormous register pressure a bit, but it would also further complicate the
instruction encoding because now you'd need to be able to specify in the
instruction if you want to use the upper or lower half of the control word.

Another important point to consider is Amdahl's Law. Even an algorithm that
does a lot of bit permutations will only spend a fraction of it's total
runtime in the bit permutation code. So even if single-butterfly-row
instructions would be 30% faster at bit permutations it would not mean that
the program is actually going to be 30% faster. I think in most
applications BEXT/BDEP would provide a sufficient speedup to shift the
performance bottleneck to other issues for most workloads.

I also think that BEXT/BDEP is more useful than single-butterfly-row
instructions in a broader context beyond just general bit permutations.
For example, just a forward and reverse butterfly is insufficient if some
bits should be duplicated and placed in multiple positions in the output
word while other input bits should be discarded.

I'd also like to make the (arguably weak :) argument that PDEP/PEXT made
it into BMI2 while generic single-butterfly-row or generic forward-butterfly
or reverse-butterfly instructions did not make it into BMI2. I would
assume that Intel did consider both options and had good reasons for going
with PDEP/PEXT instead of BFLY/IBFLY.

Finally I think that single-butterfly-row instructions are a really bad
abstraction, at least compared to the relatively straight-forward BEXT/BDEP
instructions. Remember that those instructions would also be used to
implement functions such as bit-field-extract. It is relatively
straight-forward to do that with a single BEXT instruction. But you would
require a bunch of single-butterfly-row instructions to do it and it would
not be obvious to most users how to calculate the butterfly control words
to implement the desired bit-field-extract operation.

> > At least we have prior art for VPSHUFBITQMB on the RISC-V mailing list
> > (Mar 2016), which pre-dates Intel’s public document (Oct 2017).
>
> publicly-published implementations of butterfly networks date back (at
> leasst to my knowledge) to 1983: the ALICE Network (Inmos Transputers...
> anyone remember those?), a joint project between Manchester University,
> Plessey, and Imperial College. mentioning this because there's *no way*
> anyone's getting a valid patent on this one.

Gauss used recursive application of butterfly circuits in 1805. He was
essentially doing Cooley–Tukey radix-2 FFT 160 years before Cooley and
Tukey independently rediscovered it in 1965.

regards,
- clifford

Luke Kenneth Casson Leighton

unread,

Apr 3, 2018, 10:15:11 AM4/3/18

to Clifford Wolf, RISC-V ISA Dev, Samuel Falvo II

On Mon, Apr 2, 2018 at 4:58 PM, Clifford Wolf <clif...@clifford.at> wrote:
> Hi,

>
> When I looked into BEXT/BDEP (PEXT/PDEP) I thought a lot about generic
> butterfly opcodes and ultimately discarded the idea for a number of
> reasons, but mostly because of the large amount of control words
> neccessary to perform an operation. It does not matter if you pack the
> entire config into a vector register, put it into special CSRs (as
> [hilewitz06] proposes), or load them one at a time into GPRs and use them
> with a single-butterfly-row instruction (as is proposed in this thread if I
> understand it correctly.)

hi clifford great to see someone else Gets butterfly networks. yep i
was quite enthusiastic about this initially then realised that the
control words aren't log2(bit_width) length, they're bit_width/2 [2x2
crossbars i.e. each control bit swaps *2* bits each].

> Assume you want to perform an arbitrary bit swap on a 64 bit word. That
> would require a forward butterfly (with log2(64)=6 rows) and a backward
> butterfly with another 5 rows. (the last row of the fwd butterfly and
> the first row of the inverse butterfly can be shared.)

yes, so i'd multipled log2(bit_width) * num_rows and gone "wha-hey!
that's not very many!" and (see attached) actually it's bit_width/2 *
(num_rows * 2 - 1)

by the time you get to 128-bit that comes out to 64 x 13 = 832
control words which is a leeeetle bit much :)

8-bit however is 4 x 5 = 20 bits, which is actually manageable. also
the number of gates isn't so high: i think it's 16 gates for a 2x2
crossbar, so you'd need 20 of those. 320 gates for a full in-to-out
permutation of an 8-bit number.

> Another important point to consider is Amdahl's Law. Even an algorithm that

> does a lot of bit permutations will only spend a fraction of its total

> runtime in the bit permutation code. So even if single-butterfly-row
> instructions would be 30% faster at bit permutations it would not mean that
> the program is actually going to be 30% faster. I think in most
> applications BEXT/BDEP would provide a sufficient speedup to shift the
> performance bottleneck to other issues for most workloads.

yyeahh this is where a proper study's needed, it's why i like jeff
bush's work, he took the core algorithms (the inner loops) of 3D and,
within the constraints of keeping nyuzi a general-purpose vector/SIMD
engine he showed precisely how many instructions were needed for each
pixel, in each phase of the algorithm.

my point being: for certain types of commonly mass-produced
algorithms (3D, Video processing, crypto) what is normally discarded
because it's "not part of most workloads" turns out to be critical to
a domain-specific *required* application.

digressing [as i do], that would at worst case imply that RISC-V
becomes a "mob rule" morass of instructions: great for absolutely no
specific purpose unless backed by domain-specific non-standard
extensions or external hard macros such as
https://github.com/jbush001/ChiselGPU for the last [CPU-too-intensive]
rasterisation part of 3D Graphics, or opencores video decoder blocks
(DCT and so on).

is that such a bad thing? honestly i don't know: a case could be
made either way [based on how much effort it is to even *assess* some
of these complex domain-specific areas, let alone implement].

> I also think that BEXT/BDEP is more useful than single-butterfly-row
> instructions in a broader context beyond just general bit permutations.
> For example, just a forward and reverse butterfly is insufficient if some
> bits should be duplicated and placed in multiple positions in the output
> word while other input bits should be discarded.

yes. a butterfly network when configured with crossbars normally
does permutations only. you'd need multiple stages and a mask
system... it's too much already.

> I'd also like to make the (arguably weak :) argument that PDEP/PEXT made
> it into BMI2 while generic single-butterfly-row or generic forward-butterfly
> or reverse-butterfly instructions did not make it into BMI2. I would
> assume that Intel did consider both options and had good reasons for going
> with PDEP/PEXT instead of BFLY/IBFLY.

wary of blindly following intel or in fact any other ISA (look at
ARC's over-abundance of video-specific SIMD instructions, 2000 and
climbing!), i think this one's really simple: the O(logN*M) explosion
of control bits, plus the lack of duplication capability sadly rule it
out.

> Finally I think that single-butterfly-row instructions are a really bad
> abstraction, at least compared to the relatively straight-forward BEXT/BDEP
> instructions. Remember that those instructions would also be used to
> implement functions such as bit-field-extract.

yehyeh.

> It is relatively
> straight-forward to do that with a single BEXT instruction.

could i ask you the favour of implementing it in annotated python
code? i'd like to take a look.

> But you would
> require a bunch of single-butterfly-row instructions to do it and it would
> not be obvious to most users how to calculate the butterfly control words
> to implement the desired bit-field-extract operation.

which is why i suggested implementing several boilerplate use-cases,
and with BEXT i'd still recommend doing that. lots and lots of
examples.

>> > At least we have prior art for VPSHUFBITQMB on the RISC-V mailing list
>> > (Mar 2016), which pre-dates Intel’s public document (Oct 2017).
>>
>> publicly-published implementations of butterfly networks date back (at
>> leasst to my knowledge) to 1983: the ALICE Network (Inmos Transputers...
>> anyone remember those?), a joint project between Manchester University,
>> Plessey, and Imperial College. mentioning this because there's *no way*
>> anyone's getting a valid patent on this one.
>
> Gauss used recursive application of butterfly circuits in 1805. He was
> essentially doing Cooley–Tukey radix-2 FFT 160 years before Cooley and
> Tukey independently rediscovered it in 1965.

aaawesoome. wow.

ok, so i don't yet have a mental handle on BEXT, it's a bit... weird
knowing about butterfly networks enough to be able to follow and agree
with your assessment there, and then *completely* fail with BEXT :)

i'd like to understand how BEXT (or BGS the 16-bit variant) manages
to do a full 16-bit permutation. apologies to michael, the example
you gave showed its potential *use*-case.. not how it works, and that
can really only be done with an actual reference implementation (say
in python).

ironically, internally, the routing of the bits would actually need a
crossbar (or a butterfly network, or an OpenSMART style 2x2 routing
grid), but the actual networks there would be quite small (8-in, 8-out
or for BGS @ 16-bit probably 4-in, 4-out).

a possible enhancement to consider, if going with BGS, would be to
have a bit-pre-filter setting.

* if set to 0 the bits would be selected in a sequential fashion.
* if set to 1 the bits would be selected first by even indices 0 2 4 6
8 as the first N/2 bits followed by odd 1 3 5 7 9 as the second N/2
bits
* if set to 2 the bits would be selected first by every 4th 0 4 8 12
16 ... as the first N/4 bits, followed by every 4th+1 1 5 9 13 17 ...
for the second N/4th bits...

or there may be other schemes that can be envisioned which would give
BGS a fighting chance of having a much greater coverage of more than
16 bits at a time. i'd recommend investigating such combinations pure
and simple through brute-force search, combining BGS-variants together
relentlessly even monte-carlo style to find a scheme that minimises
the number of instructions that would be needed to do full
permutations. less thought, more brawn :)

unless someone has sufficient mathematical ability to do a set theory analysis?

l.

butterfly.py

Clifford Wolf

unread,

Apr 5, 2018, 6:30:13 AM4/5/18

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Samuel Falvo II

Hi,

On Tue, Apr 03, 2018 at 03:14:45PM +0100, Luke Kenneth Casson Leighton wrote:
> > It is relatively
> > straight-forward to do that with a single BEXT instruction.
>
> could i ask you the favour of implementing it in annotated python
> code? i'd like to take a look.

For a bit field extract? Just pass a bitmask with the bits used in the
field as BEXT control signals. (The classical bit-field extract operation
just extracts a consecutive range of bits in-order.)

Bit field extract is probably a too simple example to show why BEXT can be
useful. You can also implement that by a simple up-shift followed by a
down-shift. That even fits in two RVC instructions, so it would still be
implementable as single fused 32-bit opcode.

I have short descriptions of the BEXT, BDEP, and GREVI instructions on this page:

https://github.com/cliffordwolf/bextdep

(Sorry, no python models. But simple reference C models for all three
instructions.)

There you will also find 21 Verilog cores that implement those functions
with differend Area/Speed/Function trade-offs. (And not all of those
implementations use butterfly circuits.)

> ok, so i don't yet have a mental handle on BEXT, it's a bit... weird
> knowing about butterfly networks enough to be able to follow and agree
> with your assessment there, and then *completely* fail with BEXT :)

BEXT/BDEP are called compress_right and expand_right on Jasper Neumanns
famous "Bit permutations" website:

http://programming.sirrida.de/bit_perm.html#c_e

And, btw, GREV is "Generalized bit reversal":

http://programming.sirrida.de/bit_perm.html#general_reverse_bits

He is certainly doing a much better job at explainig those instructions
that I am in my documents.. :)

> i'd like to understand how BEXT (or BGS the 16-bit variant) manages
> to do a full 16-bit permutation. apologies to michael, the example
> you gave showed its potential *use*-case.. not how it works, and that
> can really only be done with an actual reference implementation (say
> in python).

Even though a good way of implementing BEXT/BDEP is using butterfly
networks, BEXT/BDEP by itself is actually not very good at implementing bit
permutations or emulating butterfly circuits. The GREVI instruction is much
more helpful for that.

(But my previous evaluations have shown that BEXT/BDEP can be very helpful
for implementing permutations when used *together* with GREVI, Rotate-Shift,
Mask-And, and Or instructions.)

In a nutshell GREVI performs a butterfly where for each row either all
control bits are set, or all control bits are cleared. Therefore it only
needs log2(XLEN) control bits for the entire butterfly. (In this case it
does not matter if a forward or inverse butterfly network is used.)

For example, you can easily use GREVI to implement a single butterfly stage
in just four instructions:

grevi t0, a0, N
and t0, t0, a1
andc t1, a0, a1
or a0, t0, t1

With a0 containing the data input on entry and the result on exit. N is a
power of two indicating which butterfly stage to implement. And a1 contains
an XLEN bitmask derived from the XLEN/2 butterfly control word.

This bitmask can either be pre-computed or, if it needs to be generated
on-the-fly, it can easily be calculated with the help of BDEP. For example
for the first butterfly stage (using a0 as input and output):

li t0, 0x55555555
bdep t0, a0, t0
slli a0, t0, 1
or a0, a0, t0

Likewise for the second butterfly stage:

li t0, 0x33333333
bdep t0, a0, t0
slli a0, t0, 2
or a0, a0, t0

For the third stage:

li t0, 0x0f0f0f0f
bdep t0, a0, t0
slli a0, t0, 4
or a0, a0, t0

For the fourth stage:

li t0, 0x00ff00ff
bdep t0, a0, t0
slli a0, t0, 8
or a0, a0, t0

And for the fifth stage:

li t0, 0x0000ffff
and t0, a0, t0
slli a0, t0, 16
or a0, a0, t0

I have not tested the above code. So it probably has really stupid bugs.
But hopefully it conveys the idea.. :)

> which is why i suggested implementing several boilerplate use-cases,
> and with BEXT i'd still recommend doing that. lots and lots of
> examples.

I absolutely agree. Unfortunately the B-extension workgroup is dead at the
moment and I have very little motivation to spend any time on this if I
don't even know if there will be another B-extension workgroup in the
future and if they will even consider a BEXT/BDEP instruction.

I already spent *a lot of time* on this, so you can imagine how frustrating
it was for me when the B-extension workgroup was killed.

regards,
- clifford

--
Everybody believes in the normal approximation, the experimenters because they
think it is a mathematical theorem, the mathematicians because they think it is
an experimental fact.
-- G. Lippman, French Physicist, 1845-1921

Luke Kenneth Casson Leighton

unread,

Apr 5, 2018, 7:47:39 AM4/5/18

to Clifford Wolf, RISC-V ISA Dev, Samuel Falvo II

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Thu, Apr 5, 2018 at 11:30 AM, Clifford Wolf <clif...@clifford.at> wrote:
> Hi,
>
> On Tue, Apr 03, 2018 at 03:14:45PM +0100, Luke Kenneth Casson Leighton wrote:
>> > It is relatively
>> > straight-forward to do that with a single BEXT instruction.
>>
>> could i ask you the favour of implementing it in annotated python
>> code? i'd like to take a look.
>
> For a bit field extract? Just pass a bitmask with the bits used in the
> field as BEXT control signals. (The classical bit-field extract operation
> just extracts a consecutive range of bits in-order.)
>
> Bit field extract is probably a too simple example to show why BEXT can be
> useful. You can also implement that by a simple up-shift followed by a
> down-shift. That even fits in two RVC instructions, so it would still be
> implementable as single fused 32-bit opcode.
>
> I have short descriptions of the BEXT, BDEP, and GREVI instructions on this page:
>
> https://github.com/cliffordwolf/bextdep
>
> (Sorry, no python models. But simple reference C models for all three
> instructions.)

oh! right! yes great, that's perfect. ccc goooood :)

hmm i might have been confusing it with BGS, which i definitely don't grok.

> BEXT/BDEP are called compress_right and expand_right on Jasper Neumanns
> famous "Bit permutations" website:
>
> http://programming.sirrida.de/bit_perm.html#c_e
>
> And, btw, GREV is "Generalized bit reversal":
>
> http://programming.sirrida.de/bit_perm.html#general_reverse_bits
>
> He is certainly doing a much better job at explainig those instructions
> that I am in my documents.. :)

:) appreciated clifford. i'll look on that site and see if i can
find an implementation of BGS as well.

> In a nutshell GREVI performs a butterfly where for each row either all
> control bits are set, or all control bits are cleared. Therefore it only
> needs log2(XLEN) control bits for the entire butterfly. (In this case it
> does not matter if a forward or inverse butterfly network is used.)
>
> For example, you can easily use GREVI to implement a single butterfly stage
> in just four instructions:
>
> grevi t0, a0, N
> and t0, t0, a1
> andc t1, a0, a1
> or a0, t0, t1

nice.

> [...]

> I have not tested the above code. So it probably has really stupid bugs.
> But hopefully it conveys the idea.. :)

yes, very well.

>> which is why i suggested implementing several boilerplate use-cases,
>> and with BEXT i'd still recommend doing that. lots and lots of
>> examples.
>
> I absolutely agree. Unfortunately the B-extension workgroup is dead at the
> moment and I have very little motivation to spend any time on this if I
> don't even know if there will be another B-extension workgroup in the
> future and if they will even consider a BEXT/BDEP instruction.
>
> I already spent *a lot of time* on this, so you can imagine how frustrating
> it was for me when the B-extension workgroup was killed.

o um... okaay. so it's like _that_ is it? i had absolutely no idea.
that's... not good. as in, it calls into question the future
viability of RISC-V as a whole, not just if peoples' time is wasted
but if the RISC-V Foundation is so... i don't even know what adjective
to use here, but i'm really quite shocked and alarmed to hear that.

as in: what's to say that the RISC-V Foundation won't shut down
*other* extensions?? wasting *even more* people's time?

why on earth didn't the RISC-V Foundation just say something like,
"ok, sorry folks, we're running out of people and resources at the
moment to deal with B-extension, volunteers needed, have at it, come
up with something and we'll re-add people and resources when you have
something concrete?"

that would have been *infinitely* better.

l.

Clifford Wolf

unread,

Apr 5, 2018, 12:14:08 PM4/5/18

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Samuel Falvo II

On Thu, Apr 05, 2018 at 12:47:15PM +0100, Luke Kenneth Casson Leighton wrote:
> >> which is why i suggested implementing several boilerplate use-cases,
> >> and with BEXT i'd still recommend doing that. lots and lots of
> >> examples.
> >
> > I absolutely agree. Unfortunately the B-extension workgroup is dead at the
> > moment and I have very little motivation to spend any time on this if I
> > don't even know if there will be another B-extension workgroup in the
> > future and if they will even consider a BEXT/BDEP instruction.
> >
> > I already spent *a lot of time* on this, so you can imagine how frustrating
> > it was for me when the B-extension workgroup was killed.
>
> o um... okaay. so it's like _that_ is it? i had absolutely no idea.
> that's... not good. as in, it calls into question the future
> viability of RISC-V as a whole, not just if peoples' time is wasted
> but if the RISC-V Foundation is so... i don't even know what adjective
> to use here, but i'm really quite shocked and alarmed to hear that.
>
> as in: what's to say that the RISC-V Foundation won't shut down
> *other* extensions?? wasting *even more* people's time?
>
> why on earth didn't the RISC-V Foundation just say something like,
> "ok, sorry folks, we're running out of people and resources at the
> moment to deal with B-extension, volunteers needed, have at it, come
> up with something and we'll re-add people and resources when you have
> something concrete?"

It was not like that at all. You need a RISC-V foundation gold member to
chair a working group within RISC-V and the employer of the person chairing
the B-extension working group backed out of the RISC-V Foundation. So the
working group was killed.

What really bothered me about it was the "official" BS reason given for
killing the work group, which was that the group under-performed. Which is
especially ridiculous because this was one day after the first draft
proposal was completed. Initially they even completely locked the group on
the risc-v groupware thingy so all the emails and all the documents in the
groupware thingy became unavailable to everyone who had been working on it.

The problem isn't lack of volunteers. I would have volunteered to keep the
task group alive (at least until a new proper chair would have been found),
but I do not have the $10,000 per year to buy a gold membership.

Jacob Bachmeyer

unread,

Apr 7, 2018, 12:03:59 AM4/7/18

to Clifford Wolf, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Samuel Falvo II

Could the discussions continue here on the isa-dev list?

-- Jacob

Clifford Wolf

unread,

Apr 7, 2018, 11:22:29 AM4/7/18

to Jacob Bachmeyer, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Samuel Falvo II

Hi,

On Fri, Apr 06, 2018 at 11:03:57PM -0500, Jacob Bachmeyer wrote:
>> The problem isn't lack of volunteers. I would have volunteered to keep the
>> task group alive (at least until a new proper chair would have been found),
>> but I do not have the $10,000 per year to buy a gold membership.
>
> Could the discussions continue here on the isa-dev list?

Sure. But while the spec text itself might not be perfect, the set of
instructions is pretty much at a point where the path forward is to
actually implement them in at least one compiler and some cores and
continue with doing proper quantative analysis using actual hardware
designs to estimate the hw cost and actual programs to measure the gains.

But it is *much* easier to convince people to contribute to that when you
have a mandate by being an official task group. So I'm afraid it will be
hard to actually convince compiler and hw devs to participate.

I'd also expect "continued discussions" here on the isa-dev list would only
be repetitions of the discussions already had and to be honest I don't know
how much time and energy I have for repeating past discussions and
reiterating past arguments without clear outlook of this going anywhere.

But what the hell! I've spend the last few hours cleaning up the draft spec
a bit and converting it from markdown to TeX and it is now up on github:

https://github.com/cliffordwolf/xbitmanip

(see xbitmanip-33.pdf for a pre-compiled version of it)

regards,
- clifford

Luke Kenneth Casson Leighton

unread,

Apr 7, 2018, 11:31:13 AM4/7/18

to Clifford Wolf, Jacob Bachmeyer, RISC-V ISA Dev, Samuel Falvo II

On Sat, Apr 7, 2018 at 4:22 PM, Clifford Wolf <clif...@clifford.at> wrote:

> But what the hell! I've spend the last few hours cleaning up the draft spec
> a bit and converting it from markdown to TeX and it is now up on github:
>
> https://github.com/cliffordwolf/xbitmanip

niiiice. c-code functional definitions. I'd at the very least like
to a side-by-side comparison with P-Ext.

l.

lk...@lkcl.net

unread,

Apr 7, 2018, 4:29:22 PM4/7/18

to RISC-V ISA Dev, jcb6...@gmail.com, lk...@lkcl.net, sam....@gmail.com

On Saturday, April 7, 2018 at 4:22:29 PM UTC+1, clifford wrote:

But it is *much* easier to convince people to contribute to that when you
have a mandate by being an official task group.

I *might* be able to help there if I have the right information about how to go about doing that. Would a Founding-Silver Member (a University) be able to reactivate the WG?

l.

Jacob Bachmeyer

unread,

Apr 7, 2018, 8:44:14 PM4/7/18

to Clifford Wolf, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Samuel Falvo II

Clifford Wolf wrote:
> Hi,
>
> On Fri, Apr 06, 2018 at 11:03:57PM -0500, Jacob Bachmeyer wrote:
>
>>> The problem isn't lack of volunteers. I would have volunteered to keep the
>>> task group alive (at least until a new proper chair would have been found),
>>> but I do not have the $10,000 per year to buy a gold membership.
>>>
>> Could the discussions continue here on the isa-dev list?
>>
>
> Sure. But while the spec text itself might not be perfect, the set of
> instructions is pretty much at a point where the path forward is to
> actually implement them in at least one compiler and some cores and
> continue with doing proper quantative analysis using actual hardware
> designs to estimate the hw cost and actual programs to measure the gains.
>

With the spec out in the open, adding support to Rocket and GCC and
evaluating the result could be a grad student project. :-)

> But it is *much* easier to convince people to contribute to that when you
> have a mandate by being an official task group. So I'm afraid it will be
> hard to actually convince compiler and hw devs to participate.
>
> I'd also expect "continued discussions" here on the isa-dev list would only
> be repetitions of the discussions already had and to be honest I don't know
> how much time and energy I have for repeating past discussions and
> reiterating past arguments without clear outlook of this going anywhere.
>
> But what the hell! I've spend the last few hours cleaning up the draft spec
> a bit and converting it from markdown to TeX and it is now up on github:
>
> https://github.com/cliffordwolf/xbitmanip
>
> (see xbitmanip-33.pdf for a pre-compiled version of it)

I have previously proposed encodings (message-id
<59FA5380...@gmail.com>
<URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/59FA5380.4000504%40gmail.com>
and message-id <59FBA412...@gmail.com>
<URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/59FBA412.1010307%40gmail.com>)
for 8/11 instructions proposed: GREVI, rotates, and shift-ones (which I
was calling "mask shift" at the time). Are those descriptions
sufficient or would you prefer a patch or pull request?

-- Jacob

Clifford Wolf

unread,

Apr 8, 2018, 12:34:10 PM4/8/18

to Jacob Bachmeyer, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Samuel Falvo II

Hi,

On Sat, Apr 07, 2018 at 07:44:10PM -0500, Jacob Bachmeyer wrote:
> I have previously proposed encodings (message-id
> <59FA5380...@gmail.com> <URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/59FA5380.4000504%40gmail.com>
> and message-id <59FBA412...@gmail.com> <URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/59FBA412.1010307%40gmail.com>)
> for 8/11 instructions proposed: GREVI, rotates, and shift-ones
> (which I was calling "mask shift" at the time). Are those
> descriptions sufficient or would you prefer a patch or pull request?

Would you care to describe in what way your proposed encoding is in your
opinion better than the encodings described in chapter 2 of the draft spec?
(Specifically see Table 2.1 on page 8 of the current draft spec.)

Also: I think that (1) concrete instruction encodings are a red herring at
this stage and (2) there is a separate Opcode Space Management Task Group
within the RISC-V foundation and they would have the last word on concrete
encodings anyways.

The big questions right now are:
- What is the hardware cost of implementing the ALU portion of the proposed
instructions? (Instruction decoder overhead is neglectable in comparison
for any reasonable instruction encoding.)
- How well can compilers utilize the proposed instructions?
- What is the performance and size gain on real-world code?

To answer those questions we'd use a temporary NSE encoding anyways (see
chapter 5 of current draft spec). After we are done evaluating the proposed
instructions and made a final decision on what instructions to include it
makes sense to start discussing and fine-tuning the concrete final
brownfield encoding.

regards,
- clifford

Luke Kenneth Casson Leighton

unread,

Apr 8, 2018, 2:49:36 PM4/8/18

to Clifford Wolf, chuanhua.chang, Jacob Bachmeyer, RISC-V ISA Dev, Samuel Falvo II

On Sun, Apr 8, 2018 at 5:34 PM, Clifford Wolf <clif...@clifford.at> wrote:

> The big questions right now are:
> - What is the hardware cost of implementing the ALU portion of the proposed
> instructions? (Instruction decoder overhead is neglectable in comparison
> for any reasonable instruction encoding.)
> - How well can compilers utilize the proposed instructions?
> - What is the performance and size gain on real-world code?

Can I add, "how much of P can be removed as being duplicated in B?"
to the list?

l.

Jacob Bachmeyer

unread,

Apr 9, 2018, 1:03:28 AM4/9/18

to Clifford Wolf, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Samuel Falvo II, Michael Clark

Clifford Wolf wrote:
> Hi,
>
> On Sat, Apr 07, 2018 at 07:44:10PM -0500, Jacob Bachmeyer wrote:
>
>> I have previously proposed encodings (message-id
>> <59FA5380...@gmail.com> <URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/59FA5380.4000504%40gmail.com>
>> and message-id <59FBA412...@gmail.com> <URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/59FBA412.1010307%40gmail.com>)
>> for 8/11 instructions proposed: GREVI, rotates, and shift-ones
>> (which I was calling "mask shift" at the time). Are those
>> descriptions sufficient or would you prefer a patch or pull request?
>>
>
> Would you care to describe in what way your proposed encoding is in your
> opinion better than the encodings described in chapter 2 of the draft spec?
> (Specifically see Table 2.1 on page 8 of the current draft spec.)
>

The only difference is that I believe that bit 29 should be used instead
and bit 31 left zero, due to the high fanout associated with bit 31 from
its sign-extension to form immediate values in RISC-V. (I presume that
this is the reason that bit 31 was not used to select arithmetic
shifts.) Other than that small difference, it seems to mostly *be* the
encoding that I previously proposed on isa-dev. (This was not in the
draft yet when I sent that message.)

My proposed encoding, in groups by decreasing level of confidence in the
"fit":

Near-certain fit: (our only disagreement is whether bit 31 or bit 29
should be used to select shift-ones and rotate)
GREV (R-type) (fitting as "left arithmetic shift")
{opcode, funct3, funct7} = {$OP, $SLL, 7'b0100000}
GREVI (I-type)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {7'b0100000, k}} (RV32)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {6'b010000, k}} (RV64)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {5'b01000, k}} (RV128)
SLO (R-type)
{opcode, funct3, funct7} = {$OP, $SLL, 7'b0010000}
SRO (R-type)
{opcode, funct3, funct7} = {$OP, $SRL, 7'b0010000}
SLOI (I-type)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {7'b0010000, shamt}} (RV32)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {6'b001000, shamt}} (RV64)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {5'b00100, shamt}} (RV128)
SROI (I-type)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {7'b0010000, shamt}} (RV32)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {6'b001000, shamt}} (RV64)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {5'b00100, shamt}} (RV128)
ROL (R-type) (rotates fit as "arithmetic shift ones")
{opcode, funct3, funct7} = {$OP, $SLL, 7'b0110000}
ROR (R-type)
{opcode, funct3, funct7} = {$OP, $SRL, 7'b0110000}
ROLI (I-type) (hardware opcode reserved for symmetry)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {7'b0110000, shamt}} (RV32)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {6'b011000, shamt}} (RV64)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {5'b01100, shamt}} (RV128)
RORI (I-type)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {7'b0110000, shamt}} (RV32)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {6'b011000, shamt}} (RV64)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {5'b01100, shamt}} (RV128)

Slightly-less confident, but opcode and funct3 near-certain:
ANDC (R-type)
{opcode, funct3, funct7} = {$OP, $AND, 7'b0100000}

Placed here by considering funct3 $SLT and $SLTU as "special function"
codes:
PCNT (R-type)
{opcode, funct3, funct7, rs2} = {$OP, $SLT, 7'b0100000, 5'b00000}
BEXT (R-type)
{opcode, funct3, funct7} = {$OP, $SLT, 7'b0010000}
BDEP (R-type)
{opcode, funct3, funct7} = {$OP, $SLT, 7'b0110000}
CLZ (R-type)
{opcode, funct3, funct7, rs2} = {$OP, $SLT, 7'b0001000, 5'b00000}
CTZ (R-type) (hardware opcode tentatively reserved)
{opcode, funct3, funct7, rs2} = {$OP, $SLT, 7'b0101000, 5'b00000}
The funct7 codes in this group are bit-reflected counting in binary,
starting at bit 30.

Additionally, ANDCI, PCNTI, CLZI, CTZI, are all possible as assembler
pseudo-instructions. ANDCI simply assembles to ANDI with the immediate
complemented. PCNTI, CLZI, and CTZI can all be evaluated in the
assembler and assembled as LI with the result.

BEXT and BDEP require XLEN-bit control words and therefore cannot have
an immediate form.

The *W variants on RV64 are encoded almost identically, but with the
opcode field changed to $OP-32 or $OP-IMM-32 as appropriate. The same
pattern provides the *D variants for RV128.

Note that *W and *D variants of some of these instructions could be
assembler pseudo-instructions and macro-op fused at runtime.

= =

> Also: I think that (1) concrete instruction encodings are a red herring at
> this stage and (2) there is a separate Opcode Space Management Task Group
> within the RISC-V foundation and they would have the last word on concrete
> encodings anyways.
>

I firmly believe that any proposed instruction should include a proposed
concrete encoding, if for no other reason than to prove that the
instruction *can* be encoded in RISC-V's constraints. (The proposed bit
manipulation instructions can be so encoded, but I seem to recall having
seen instructions proposed on isa-dev that could not be encoded.)

I suppose that the use of temporary encodings for compiler usage
evaluations is less of a concern, although at this stage, assembler
pseudo-instructions are likely more important: if we find that
compilers generate CTZ far more often than CLZ, for example, we might
want to make CTZ a hardware opcode. If nothing else, we could define
all of the proposed bit manipulation instructions as assembler
pseudo-instructions that expand into small inline subroutines, thus
making pseudo-RVXbitmanip.

Putting the temporary encoding in the guaranteed non-standard encoding
space is being excessively deferential to bureaucrats, especially when
the RISC-V spec (Chapter 21 "Extending RISC-V") specifically accepts
that non-standard extensions "may conflict with other standard or
non-standard extensions". If our final proposed encoding is somehow
horribly wrong or conflicts with other standard extensions adopted in
the meantime, we can work with Opcode Space Management to fix it while
RVXbitmanip is being promoted to RVB. Until then, I argue that we
should avoid creating a branch of the tools that we *know* will be
long-term incompatible.

> The big questions right now are:
> - What is the hardware cost of implementing the ALU portion of the proposed
> instructions? (Instruction decoder overhead is neglectable in comparison
> for any reasonable instruction encoding.)
> - How well can compilers utilize the proposed instructions?
> - What is the performance and size gain on real-world code?
>
> To answer those questions we'd use a temporary NSE encoding anyways (see
> chapter 5 of current draft spec). After we are done evaluating the proposed
> instructions and made a final decision on what instructions to include it
> makes sense to start discussing and fine-tuning the concrete final
> brownfield encoding.

Aside from the shift-ones instructions, which have trivial ALU cost, I
suspect that ALU costs will strongly depend on the ALU topology used.
For example, ANDC could require an additional bank of XOR gates, or
simply require slightly more complex control logic and use the ALU's
main XOR function to invert rs2. Rotates likewise could range from
duplicating the input and funnel shifting to actually moving bits
one-at-a-time. There may be some kind of "folded shifter" topology that
makes duplicating the input and funnel shifting much cheaper by avoiding
the need for an actual funnel shifter.

GREV seems to require its own permutation network, but some subsets are
easily added to some shifters, as Allen Baum once mentioned (but I
cannot find that message at the moment). Can BEXT/BDEP use the same
permutation network as GREV?

PCNT, CLZ, CTZ are all reduction operations and should be able to share
most of their logic.

Adding Michael to the CC list on this thread; he has been one of our
software experts on this and perhaps he will be interested. He knows
more about compilers than I do. To Michael: we are carrying on bit
manipulation development using
<URL:https://github.com/cliffordwolf/xbitmanip> after the BitManip WG
was shut down.

-- Jacob

Rogier Brussee

unread,

Apr 9, 2018, 10:59:33 AM4/9/18

to RISC-V ISA Dev, clif...@clifford.at, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jcb6...@gmail.com, jhause...@gmail.com

Op maandag 9 april 2018 07:03:28 UTC+2 schreef Jacob Bachmeyer:

k is just a shift amount (right?), i.e. GREVI basically shares the encoding of the shifts?

SLO (R-type)
{opcode, funct3, funct7} = {$OP, $SLL, 7'b0010000}
SRO (R-type)
{opcode, funct3, funct7} = {$OP, $SRL, 7'b0010000}
SLOI (I-type)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {7'b0010000, shamt}} (RV32)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {6'b001000, shamt}} (RV64)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {5'b00100, shamt}} (RV128)
SROI (I-type)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {7'b0010000, shamt}} (RV32)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {6'b001000, shamt}} (RV64)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {5'b00100, shamt}} (RV128)
ROL (R-type) (rotates fit as "arithmetic shift ones")
{opcode, funct3, funct7} = {$OP, $SLL, 7'b0110000}
ROR (R-type)
{opcode, funct3, funct7} = {$OP, $SRL, 7'b0110000}

ROR is just ROL with a negative value for rs2 (since rs2 is taken mod log2(XLEN)) so I think only one of ROL and ROR it is needed.

ROLI (I-type) (hardware opcode reserved for symmetry)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {7'b0110000, shamt}} (RV32)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {6'b011000, shamt}} (RV64)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {5'b01100, shamt}} (RV128)
RORI (I-type)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {7'b0110000, shamt}} (RV32)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {6'b011000, shamt}} (RV64)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {5'b01100, shamt}} (RV128)

RORI is just ROLI with a negative immediate so I think only one of ROL and ROR it is needed.

Slightly-less confident, but opcode and funct3 near-certain:
ANDC (R-type)
{opcode, funct3, funct7} = {$OP, $AND, 7'b0100000}

Placed here by considering funct3 $SLT and $SLTU as "special function"
codes:

SLT and SLTU seem the right opcodes to try to reserve for instructions "close" to comparison. PULP for example, has min[u] and max[u]. or for bit manipulation, one could think of a variant of slt and sltu that

returns -1 (i.e. all bits 1) instead of 1:

negslt rd rs1 rs2 --> slt rd rs1 rs2; sub rd zero rd

negsltu rd rs1 rs2 --> sltu rd rs1 rs2; sub rd zero rd

negsli rd rs1 imm --> slt rd rs1 imm; sub rd zero rd

negslui rd rs1imm --> sltui rd rs1 imm; sub rd zero rd

PCNT (R-type)
{opcode, funct3, funct7, rs2} = {$OP, $SLT, 7'b0100000, 5'b00000}

PCNT, CLZ and CTZ would be the first true 2 register instructions. This is probably discussed somewhere,

but for regularity of the instruction decoding logic it would seem useful to make them in a 3 instructions by composing with a more or less

trivial binary operation that ideally adds some value even if the main use of the instruction would be with one register set to x0 == zero.

For PCNT two possibilities come to mind:

bcnt rd rs1 rs2 pop_count(rs1 &~ rs2) #bcnt is more general than pcnt

bcnt (R-type)

{opcode, funct3, funct7} = {$OP, $AND, 7'b0101'000}

pseudo's

pcnt rd rs1 bcnt rd rs1 zero

zcnt rd rs1 li rd -1; bcnt rd rd rs1

ctz rd rs1 sub rd zero rs1; bcnt rd rd rs1 #note this does ctz: 0 --> 0

hamm rd rs1 rs2 popcount(rs1 ^ rs2) #hamm stands for Hamming distance

hamm (R-type)

{opcode, funct3, funct7} = {$OP, $XOR, 7'b0100'000}

pseudo's

pcnt rd rs1 hamm rd rs1 zero

zcnt rd rs1 li rd -1; hamm rd rd rs1

ffs rd rs1 addi rd rs1 -1; hamm rd rd rs1 #note this does ffs : 0 --> XLEN but also ffs : 1<<(XLEN -1) --> XLEN

BEXT (R-type)
{opcode, funct3, funct7} = {$OP, $SLT, 7'b0010000}
BDEP (R-type)
{opcode, funct3, funct7} = {$OP, $SLT, 7'b0110000}
CLZ (R-type)
{opcode, funct3, funct7, rs2} = {$OP, $SLT, 7'b0001000, 5'b00000}

Again one could make this a proper 3 register instruction:

clz rd rs1 rs2 count_leading_zeros(rs1 & ~rs)

clz (R-type)

{opcode, funct3, funct7} = {$OP, $AND, 7'b0110'000}

CLZ main use is softfloat, I think, and the additional & would seem to be useful for normalising denormalised numbers that are packed in exponent and mantissa

John Hauser maintains the UCBAR Softfloat lib and should know much better than I do.

likewise

ctz (R-type)

{opcode, funct3, funct7} = {$OP, $AND, 7'b0111'000}

CTZ (R-type) (hardware opcode tentatively reserved)
{opcode, funct3, funct7, rs2} = {$OP, $SLT, 7'b0101000, 5'b00000}
The funct7 codes in this group are bit-reflected counting in binary,
starting at bit 30.

Additionally, ANDCI, PCNTI, CLZI, CTZI, are all possible as assembler
pseudo-instructions. ANDCI simply assembles to ANDI with the immediate
complemented. PCNTI, CLZI, and CTZI can all be evaluated in the
assembler and assembled as LI with the result.

BEXT and BDEP require XLEN-bit control words and therefore cannot have
an immediate form.

The *W variants on RV64 are encoded almost identically, but with the
opcode field changed to $OP-32 or $OP-IMM-32 as appropriate. The same
pattern provides the *D variants for RV128.

Note that *W and *D variants of some of these instructions could be
assembler pseudo-instructions and macro-op fused at runtime.

= =

> Also: I think that (1) concrete instruction encodings are a red herring at
> this stage and (2) there is a separate Opcode Space Management Task Group
> within the RISC-V foundation and they would have the last word on concrete
> encodings anyways.
>

I firmly believe that any proposed instruction should include a proposed
concrete encoding, if for no other reason than to prove that the
instruction *can* be encoded in RISC-V's constraints.

+1.

That and the cost of the encodings. In particular, OP-IMM is a non renewable resource.

I can imagine implementations really just wanting bswap and pcnt. How much additional cost does the whole B extension bring?

Paul A. Clayton

unread,

Apr 9, 2018, 1:51:41 PM4/9/18

to jcb6...@gmail.com, RISC-V ISA Dev

On 4/9/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
[snip]

> Aside from the shift-ones instructions, which have trivial ALU cost, I
> suspect that ALU costs will strongly depend on the ALU topology used.
> For example, ANDC could require an additional bank of XOR gates, or
> simply require slightly more complex control logic and use the ALU's
> main XOR function to invert rs2.

I would have thought one would use the operand inverter used for
subtraction (but I am not a hardware developer). If so, one would
want to have the complemented operand to use the same operand
field as the subtrahend in subtraction. (I did not look to see if that
was the case for your proposed encoding.)

Rogier Brussee

unread,

Apr 9, 2018, 3:32:11 PM4/9/18

to RISC-V ISA Dev, jcb6...@gmail.com, lk...@lkcl.net, sam....@gmail.com

Thanks for making the draft B-proposal available and accessible!

Here are some (hopefully constructive) comments:

* I think that a ROL and ROLI are redundant if you have ROR and RORI (or vice versa) because one is just the other with negative rs2/immediate.

* pcnt, clz and ctz are the first 2 register intructions introduced. They can naturally be changed to three register R-type instructions for more regularity

bcnt rd rs1 rs2 : pop_count (rs1& ~rs2)

pseudo's

pcnt rd rs1 --> bcnt rs1 zero #counts 1's

zcnt rd rs1 --> li rd -1; bcnt rd rd rs1 #counts 0's

ctz rd rs1 --> addi rd rs1 -1; bcnt rd rs1 rd #counts trailing zero's. (note ~(x-1) = -(x-1) -1 = -x) . This pseudo ctz does: 0 --> 0

bclz rd rs1 rs2: count_leading_zeros(rs1& ~rs2)

pseudo's

clz rd rs1 : bclz rd rs1 zero #count leading zeros

clo rd rs1 : li rd -1; bclz rd rd rs1 #count leading 1's

See also more detailed inline comments in the reaction to Jacob Bachmeier.

As a general remark: I think it would be instructive to have rationale for instructions in terms of the kind of problems where they find use.

Here is a starter basically an update of https://groups.google.com/a/groups.riscv.org/forum/#!searchin/isa-dev/Rogier$20Brussee$20Bit$20manipulation/isa-dev/HcoNWhMJVeI/_KiCTpcoAAAJ:

Opcode application(s) Alternatives

ROL[W] rd,rs1,rs2 Cryptography, hash, fusion

ROR[W] rd,rs1,rs2 same ADD rd, zero rs2; ROL[w] rd, rs1, rd

ROLI[W] rd,rs1,shamt same
RORI[W] rd,rs1,shamt same ROLI rd, rs1, -shamt

CLZ[W] rd,rs1 softfloat, ilog2 BCLZ rd rs1 zero

BCLZ[W] rd rs1 rs2 softfloat, ilog2

PCNT[W] rd,rs1 malloc, bitmaps bcnt rd rs1 zero

CTZ[W] rd,rs1 malloc, bitmaps addi rd rs1 -1; bcnt rd rs1 rd

BCNT rd, rs, rs2 malloc, bitmaps

BREV[W] rd,rs1 Fast Fourier Transform, bitmaps?? GREVI rd, -1

BSWAP[W] rd,rs1 be <-> le e.g. networking, GREVI rd, - 8 (?)

GREVI[W] rd rs1 shamt see BSWAP, BREV bitpermutations BSWAP, BREV instructions for common case, trapping for others

BEXT[W] rd,rs1,rs2 instruction decoding in traps, bitmaps masking and shifting, instructions or csr's that directly "parse" the trapping instruction into (instr, rd, rs1, rs2, rs3, imm/shamt)??????.
BDEP[W] rd,rs1,rs2 bitmaps masking and shifting.

ANDC rd rs1 rs2 clear bit, fusion XORI rd, rs2, -1; AND rd, rd, rs1, (possibly extending C extension with not instruction)

SLO rd rs1 rs2 constructing masks, fusion XORI rd, rs1, -1; SLL rd, rd, rs2; XORI rd, rd, -1

SLOI rd rs1 shamt same XORI rd, rs1, -1; SLLI rd, rd, shamt; XORI rd, rd, -1

SRO rd rs1 rs2 same XORI rd, rs1, -1; SRL rd, rd, shamt; XORI rd, rd, -1

SROI rd rs1 shamt same XORI rd rs1, -1; SRLI rd rd shamt; XORI rd rd -1

Op zaterdag 7 april 2018 17:22:29 UTC+2 schreef clifford:

Hi,

On Fri, Apr 06, 2018 at 11:03:57PM -0500, Jacob Bachmeyer wrote:
>> The problem isn't lack of volunteers. I would have volunteered to keep the
>> task group alive (at least until a new proper chair would have been found),
>> but I do not have the $10,000 per year to buy a gold membership.
>
> Could the discussions continue here on the isa-dev list?

Sure. But while the spec text itself might not be perfect, the set of
instructions is pretty much at a point where the path forward is to
actually implement them in at least one compiler and some cores and
continue with doing proper quantative analysis using actual hardware
designs to estimate the hw cost and actual programs to measure the gains.

But it is *much* easier to convince people to contribute to that when you
have a mandate by being an official task group. So I'm afraid it will be
hard to actually convince compiler and hw devs to participate.

I'd also expect "continued discussions" here on the isa-dev list would only
be repetitions of the discussions already had and to be honest I don't know
how much time and energy I have for repeating past discussions and
reiterating past arguments without clear outlook of this going anywhere.

Sorry in advance if I restart discussions.

Luke Kenneth Casson Leighton

unread,

Apr 9, 2018, 4:12:53 PM4/9/18

to Rogier Brussee, RISC-V ISA Dev, Jacob Bachmeyer, Samuel Falvo II

On Mon, Apr 9, 2018 at 8:32 PM, Rogier Brussee <rogier....@gmail.com> wrote:

>> I'd also expect "continued discussions" here on the isa-dev list would
>> only
>> be repetitions of the discussions already had and to be honest I don't
>> know
>> how much time and energy I have for repeating past discussions and
>> reiterating past arguments without clear outlook of this going anywhere.
>>
>
> Sorry in advance if I restart discussions.

i have long-term experience in the kinds of working-practices that
mitigate such issues: they do however require some discipline on the
part of *at least* one person and preferably all participants, as
otherwise it gets extremely tedious as, particularly in a
collaborative group where each contributor is self-sovereign [not a
paid-employee basically] the "one" person feels like they're being
taken advantage of if they're the only one doing the
information-maintenance. unfortunately "back-dating" such practices
also requires... patience shall we say.

the method basically involves utilising an online collaborative wiki,
at the bare minimum quite literally throwing in (cut/pasting)
fragments of conversation for later sorting, but at the very least
simply maintaining a series of links under section sub-headings with
even just a couple of brief words for each. "Discussion of GREVI:
{insert link}" or "Assessment of use-cases for proposed instructions:
{insert link}" or "Clifford's draft proposal: {insert link}" and so
on.

the wiki page (or pages if if gets particularly comprehensive) thus
becomes a "central reference" for free-form collation of critical
information that is otherwise - as any forum or mailing list user
quickly finds - flat-out impossible to find.

the wiki page basically allows people to short-circuit unnecessary
discussions by acting as a place where, when newcomers start asking
questions, people can go hah, i remember that discussion, let me just
to to the page, cut/paste the link, and reply "This Has Been Discussed
Already Please Read It First Thank You"

the long and short of it is: you are free and at liberty to choose
one of the following:

(1) use a forum or mailing list as a "document storage and retrieval
system" and be driven utterly insane and up the wall by the endless
repetition
(2) use a wiki to collate information and be slightly irritated by
what *appears* to be "time-wasting"
(3) give up

i can say more if people are interested, as well as either set up a
wiki or recommend one that is already up.

l.

Rogier Brussee

unread,

Apr 9, 2018, 4:23:36 PM4/9/18

to RISC-V ISA Dev, clif...@clifford.at, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jcb6...@gmail.com, jhause...@gmail.com

Op maandag 9 april 2018 16:59:33 UTC+2 schreef Rogier Brussee:

bcnt rd rs1 rs2 pop_count(rs1 &~ rs2) #bcnt is more general than pcnt

bcnt (R-type)
{opcode, funct3, funct7} = {$OP, $AND, 7'b0101'000}

pseudo's

pcnt rd rs1 bcnt rd rs1 zero
zcnt rd rs1 li rd -1; bcnt rd rd rs1
ctz rd rs1 sub rd zero rs1; bcnt rd rd rs1 #note this does ctz: 0 --> 0

Oops

count_trailing_zero(x) == pop_count(x & (-x))

since ~(x-1) == -(x - 1) -1 = -x

that should be

ctz rd rs1 : addi rd rd -1; bcnt rd rs1 rd

Michael Clark

unread,

Apr 9, 2018, 5:38:02 PM4/9/18

to Jacob Bachmeyer, Clifford Wolf, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Samuel Falvo II

I’ll have a look when I get a chance.

One point that I would like to make from an analysis perspective, is that we should include width variants where appropriate. It would be unfortunate to constrain the analysis to XLEN width operations only, and indeed it would not be a complete quantitive analysis if the cost in instructions (and implementations) was not compared for XLEN width-only vs having both 32-bit and 64-bit width variants on RV64. One has to compare the dynamic retired instruction count with and without 32-bit variants on 64-bit, to make an assessment of the value of encoding W variants as instructions vs pseudos. There is also the macro-op fusion argument, but note that this places a burden on simple in-order RV64 systems that don’t implement macro-op fusion. These are perhaps the systems that can benefit the most from fewer dynamic retired instructions, assuming from quantitive analysis, that it is seen that there is benefit from the 32-bit variant.

Now, regarding quantitive analysis. It would be disingenuous to use SPECint to measure benefits for a small embedded “router”. One must have application profiles that one can measure the cost and benefits. i.e. a network and crypto profile. Note many small systems implement software network and crypto algorithms and many of the algorithms have been optimised for software implementations. e.g. salsa20 chacha20-poly1305. While chacha20-poly1305 doesn’t require byte swapping on little-endian, there are however many software versions of algorithms in plain C or optimized assembly that use BSWAP32 (GREVIW 24) and ROR32 (RORIW, ROLIW) on 64-bit systems. I suggest we use SuperCop and “openssl speed” and nginx or apache TLS as part of the small network and embedded router profile.

For example, a 64-bit rotate is not very useful for SHA-256 (as one example) on RV64. It essentially makes rotate useless except where it is equal to XLEN. SHA-512 uses 64-bit rotate so it would benefit, but SHA-256 would not. If one spends some time going through the large set of commonly used crypto ciphers, digest algorithms and what not, one would find that a 64-bit system will frequently invoke rotate and bswap primitives on 32-bit quantities as many crypto algorithms are designed to use 32-bit “big-endian” values so that they work on heterogenous 32-bit and 64-bit systems, where previously big-endian with 32-bit values was the standard for recent generation crypto algorithms.

It’s only very recent crypto algorithms that have started recognising that “little-endian” has won. e.g. Curve25519 and Ed25519 use little-endian encodings for curve points in contrast to standard elliptic curve algorithms which use “big-endian” curve points.

It would be nice to get the full suite implemented in one of the compilers (e.g. GCC), so that it is possible to do analysis before and after pruning, versus eagerly pruning without any quantitive analysis. If we have width variant encodings, when I could perhaps help with binutils and gcc patches when I get time (but probably not this month).

Jacob Bachmeyer

unread,

Apr 10, 2018, 12:28:56 AM4/10/18

to Rogier Brussee, RISC-V ISA Dev, clif...@clifford.at, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

Rogier Brussee wrote:
>
>
> Op maandag 9 april 2018 07:03:28 UTC+2 schreef Jacob Bachmeyer:
>
> Clifford Wolf wrote:
> > Hi,
> >
> > On Sat, Apr 07, 2018 at 07:44:10PM -0500, Jacob Bachmeyer wrote:
> >
> >> I have previously proposed encodings (message-id

> >> <59FA5380...@gmail.com <javascript:>>
> <URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/59FA5380.4000504%40gmail.com
> <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/59FA5380.4000504%40gmail.com>>
>
> >> and message-id <59FBA412...@gmail.com <javascript:>>
> <URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/59FBA412.1010307%40gmail.com

Yes and no. GREV's "k" parameter is different from shift amount, but
shares the same domain [0:log2(XLEN)].

> SLO (R-type)
> {opcode, funct3, funct7} = {$OP, $SLL, 7'b0010000}
> SRO (R-type)
> {opcode, funct3, funct7} = {$OP, $SRL, 7'b0010000}
> SLOI (I-type)
> {opcode, funct3, imm12} = {$OP-IMM, $SLL, {7'b0010000, shamt}}
> (RV32)
> {opcode, funct3, imm12} = {$OP-IMM, $SLL, {6'b001000, shamt}}
> (RV64)
> {opcode, funct3, imm12} = {$OP-IMM, $SLL, {5'b00100, shamt}}
> (RV128)
> SROI (I-type)
> {opcode, funct3, imm12} = {$OP-IMM, $SRL, {7'b0010000, shamt}}
> (RV32)
> {opcode, funct3, imm12} = {$OP-IMM, $SRL, {6'b001000, shamt}}
> (RV64)
> {opcode, funct3, imm12} = {$OP-IMM, $SRL, {5'b00100, shamt}}
> (RV128)
> ROL (R-type) (rotates fit as "arithmetic shift ones")
> {opcode, funct3, funct7} = {$OP, $SLL, 7'b0110000}
> ROR (R-type)
> {opcode, funct3, funct7} = {$OP, $SRL, 7'b0110000}
>
>
>
> ROR is just ROL with a negative value for rs2 (since rs2 is taken mod
> log2(XLEN)) so I think only one of ROL and ROR it is needed.

For hardware opcodes, we may as well have both, since making one of them
a pseudo-instruction requires a two-instruction sequence that would
clobber a temporary.

> ROLI (I-type) (hardware opcode reserved for symmetry)
> {opcode, funct3, imm12} = {$OP-IMM, $SLL, {7'b0110000, shamt}}
> (RV32)
> {opcode, funct3, imm12} = {$OP-IMM, $SLL, {6'b011000, shamt}}
> (RV64)
> {opcode, funct3, imm12} = {$OP-IMM, $SLL, {5'b01100, shamt}}
> (RV128)
> RORI (I-type)
> {opcode, funct3, imm12} = {$OP-IMM, $SRL, {7'b0110000, shamt}}
> (RV32)
> {opcode, funct3, imm12} = {$OP-IMM, $SRL, {6'b011000, shamt}}
> (RV64)
> {opcode, funct3, imm12} = {$OP-IMM, $SRL, {5'b01100, shamt}}
> (RV128)
>
>
>
> RORI is just ROLI with a negative immediate so I think only one of ROL
> and ROR it is needed.

The draft proposes making RORI the hardware op and ROLI an assembler
pseudo-op. I agree that only one of these is needed, and propose
reserving the ROLI hardware opcode simply for symmetry with the shifts,
even though the assembler will not produce that opcode and
implementations are not expected to implement that opcode. (The
reservation can be dropped in a later draft if appropriate.)

>
> Slightly-less confident, but opcode and funct3 near-certain:
> ANDC (R-type)
> {opcode, funct3, funct7} = {$OP, $AND, 7'b0100000}
>
> Placed here by considering funct3 $SLT and $SLTU as "special
> function"
> codes:
>
>
> SLT and SLTU seem the right opcodes to try to reserve for instructions
> "close" to comparison. PULP for example, has min[u] and max[u]. or for
> bit manipulation, one could think of a variant of slt and sltu that
> returns -1 (i.e. all bits 1) instead of 1:
>
> negslt rd rs1 rs2 --> slt rd rs1 rs2; sub rd zero rd
> negsltu rd rs1 rs2 --> sltu rd rs1 rs2; sub rd zero rd
> negsli rd rs1 imm --> slt rd rs1 imm; sub rd zero rd
> negslui rd rs1imm --> sltui rd rs1 imm; sub rd zero rd

Another option, instead of using $SLT and $SLTU as "special function"
codes, is to take the same approach that RVM uses and define a new
funct7 value and reuse funct3 to distinguish operations.

I think that I would prefer Hamming distance, unless there is some use
for BCNT of which I am not aware.

Agreed, but regularity of the encoding is also important. 1/1024 of
OP-IMM (for ROLI) may be a suitable price to pay for maintaining that
regularity.

I had earlier suggested that implementations could subset GREV, and
encoding BSWAP as a GREV pseudo-instruction would be compatible with
such an arrangement.

-- Jacob

Jacob Bachmeyer

unread,

Apr 10, 2018, 12:31:50 AM4/10/18

to Paul A. Clayton, RISC-V ISA Dev

The spec is unclear, but I expect that SUB performs rs1 - rs2 -> rd.
ANDC performs rs1 & ~rs2 -> rd. The small catch with reusing the same
logic element for SUB and ANDC is that SUB needs 2's-complement, while
ANDC needs 1's-complement.

-- Jacob

Jacob Bachmeyer

unread,

Apr 10, 2018, 12:45:22 AM4/10/18

to Michael Clark, Clifford Wolf, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Samuel Falvo II

Michael Clark wrote:
> On 9/04/2018, at 5:03 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>

>> [...]

>> The *W variants on RV64 are encoded almost identically, but with the opcode field changed to $OP-32 or $OP-IMM-32 as appropriate. The same pattern provides the *D variants for RV128.
>>
>> Note that *W and *D variants of some of these instructions could be assembler pseudo-instructions and macro-op fused at runtime.
>>

>> [...]

>>
>> Adding Michael to the CC list on this thread; he has been one of our software experts on this and perhaps he will be interested. He knows more about compilers than I do. To Michael: we are carrying on bit manipulation development using <URL:https://github.com/cliffordwolf/xbitmanip> after the BitManip WG was shut down.
>>
>
> I’ll have a look when I get a chance.
>
> One point that I would like to make from an analysis perspective, is that we should include width variants where appropriate. It would be unfortunate to constrain the analysis to XLEN width operations only, and indeed it would not be a complete quantitive analysis if the cost in instructions (and implementations) was not compared for XLEN width-only vs having both 32-bit and 64-bit width variants on RV64. One has to compare the dynamic retired instruction count with and without 32-bit variants on 64-bit, to make an assessment of the value of encoding W variants as instructions vs pseudos. There is also the macro-op fusion argument, but note that this places a burden on simple in-order RV64 systems that don’t implement macro-op fusion. These are perhaps the systems that can benefit the most from fewer dynamic retired instructions, assuming from quantitive analysis, that it is seen that there is benefit from the 32-bit variant.
>

The width variants are included in my proposed encoding, although I
specified them by citing a rule instead of doubling or tripling what was
already a lengthy table. Similarly, for software testing, I agree that
all of them should be tentatively defined as "hardware opcodes" to ease
analysis at this stage. We can always convert them to
pseudo-instructions after we have data about their usage. (I advocate
that *W and *D variants of all bit-manipulation opcodes *should* exist,
but many of them are one-or-two instruction sequences and may be best
overall as assembler pseudo-instructions.)

> Now, regarding quantitive analysis. It would be disingenuous to use SPECint to measure benefits for a small embedded “router”. One must have application profiles that one can measure the cost and benefits. i.e. a network and crypto profile. Note many small systems implement software network and crypto algorithms and many of the algorithms have been optimised for software implementations. e.g. salsa20 chacha20-poly1305. While chacha20-poly1305 doesn’t require byte swapping on little-endian, there are however many software versions of algorithms in plain C or optimized assembly that use BSWAP32 (GREVIW 24) and ROR32 (RORIW, ROLIW) on 64-bit systems. I suggest we use SuperCop and “openssl speed” and nginx or apache TLS as part of the small network and embedded router profile.
>

No objection, but note that GREV/GREVI will not alter the high half of a
32-bit value unless directed to swap the high and low halves of the
64-bit word. Further analysis is needed to determine how often the
sign-extension needs to be preserved; if the value is only used in
bitwise operations and then written to memory as a 32-bit word,
sign-extension need not be preserved.

> For example, a 64-bit rotate is not very useful for SHA-256 (as one example) on RV64. It essentially makes rotate useless except where it is equal to XLEN. SHA-512 uses 64-bit rotate so it would benefit, but SHA-256 would not. If one spends some time going through the large set of commonly used crypto ciphers, digest algorithms and what not, one would find that a 64-bit system will frequently invoke rotate and bswap primitives on 32-bit quantities as many crypto algorithms are designed to use 32-bit “big-endian” values so that they work on heterogenous 32-bit and 64-bit systems, where previously big-endian with 32-bit values was the standard for recent generation crypto algorithms.
>

This is also an interesting application for
packed-SIMD-on-integer-registers.

> It’s only very recent crypto algorithms that have started recognising that “little-endian” has won. e.g. Curve25519 and Ed25519 use little-endian encodings for curve points in contrast to standard elliptic curve algorithms which use “big-endian” curve points.
>
> It would be nice to get the full suite implemented in one of the compilers (e.g. GCC), so that it is possible to do analysis before and after pruning, versus eagerly pruning without any quantitive analysis. If we have width variant encodings, when I could perhaps help with binutils and gcc patches when I get time (but probably not this month).
>

The width variant encodings, for any encoding using OP and OP-IMM, are
produced by a simple rule: replace OP with OP-32 or OP-64 and OP-IMM
with OP-IMM-32 or OP-IMM-64.

-- Jacob

Jacob Bachmeyer

unread,

Apr 10, 2018, 12:53:02 AM4/10/18

to jcb6...@gmail.com, Rogier Brussee, RISC-V ISA Dev, clif...@clifford.at, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

Jacob Bachmeyer wrote:
> Rogier Brussee wrote:
>> Op maandag 9 april 2018 07:03:28 UTC+2 schreef Jacob Bachmeyer:

>> [...]

>> GREV (R-type) (fitting as "left arithmetic shift")
>> {opcode, funct3, funct7} = {$OP, $SLL, 7'b0100000}
>> GREVI (I-type)
>> {opcode, funct3, imm12} = {$OP-IMM, $SLL, {7'b0100000, k}}
>> (RV32)
>> {opcode, funct3, imm12} = {$OP-IMM, $SLL, {6'b010000, k}} (RV64)
>> {opcode, funct3, imm12} = {$OP-IMM, $SLL, {5'b01000, k}} (RV128)
>>
>>
>>
>> k is just a shift amount (right?), i.e. GREVI basically shares the
>> encoding of the shifts?
>
> Yes and no. GREV's "k" parameter is different from shift amount, but
> shares the same domain [0:log2(XLEN)].

Correction: GREV's "k" parameter has the domain [0:XLEN). (What I get
for writing emails when I should go to bed... Good Night, all.)

-- Jacob

Rogier Brussee

unread,

Apr 10, 2018, 3:54:39 AM4/10/18

to RISC-V ISA Dev, clif...@clifford.at, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jcb6...@gmail.com, jhause...@gmail.com

Op maandag 9 april 2018 22:23:36 UTC+2 schreef Rogier Brussee:

Ouch. Forget that, its nonsense. The correct formula is

count_trailing_zero(x) = pop_count (-1 + (x & (-x)))

Rogier Brussee

unread,

Apr 10, 2018, 4:49:26 AM4/10/18

to RISC-V ISA Dev, rogier....@gmail.com, clif...@clifford.at, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com, jcb6...@gmail.com

Op dinsdag 10 april 2018 06:28:56 UTC+2 schreef Jacob Bachmeyer:

FWIW:

ROL rd rs1 rs2 --> SUB rd zero rs2; ROR rd rs1 rd

does not need a temporary. It may be my ignorance, but the R-type form of rotate does not seems critical, and if you drop ROLI, dropping ROL makes the encoding more regular.

Shrug.

That has some merit, but I like to think of B as just some additional more specialised I instructions and I think it should just get as little in the way as possible for other extensions that want to define variants of existing instructions. Having B instructions close to "and", "or" and "xor" as well as sll and srl feels right and is what I proposed for hamm/bcnt and clz. That feeling may well be just that though: an irrational feeling.

I thought bcnt might be more regular with clz (where I can see an extra mask to be useful) and I thought it naturally gives ctz but that was nonsense. However, FWIW, I think HAMM is more elegant myself. Something with a name tends to have uses.

One could define BEXTI and BDEPI like BEXT and BDEP but with rs2 replaced with a _zero_ extended 12 bit immediate to grab and insert in the low 12 bits (or effectively the low 11 bits if you insist on sign extension of immediates).

Right.

-- Jacob

Luke Kenneth Casson Leighton

unread,

Apr 11, 2018, 6:10:48 PM4/11/18

to Jacob Bachmeyer, Rogier Brussee, RISC-V ISA Dev, Clifford Wolf, Samuel Falvo II, Michael Clark, John Hauser

On Tue, Apr 10, 2018 at 5:28 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

>> ROR is just ROL with a negative value for rs2 (since rs2 is taken mod
>> log2(XLEN)) so I think only one of ROL and ROR it is needed.
>
>
> For hardware opcodes, we may as well have both, since making one of them a
> pseudo-instruction requires a two-instruction sequence that would clobber a
> temporary.

And, whilst it is strictly speaking true that ROR = - ROL (and there
is the intention to have an imm variant?) You require a signed
immediate to do so, and in doing so you need one more bit to cover
both cases (one of which now needs 2's complement decoding).

In the registerised version you may not have the negative value in
order to do the opposite rotate and thus would need to... oh that's
what you're talking about Jacob :) yes, 50% of the time you would
need 2 instructions not one.

l.

Jacob Bachmeyer

unread,

Apr 11, 2018, 6:54:10 PM4/11/18

to Rogier Brussee, RISC-V ISA Dev, clif...@clifford.at, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

Rogier Brussee wrote:
> Op dinsdag 10 april 2018 06:28:56 UTC+2 schreef Jacob Bachmeyer:
>
> Rogier Brussee wrote:
> >
> >
> > Op maandag 9 april 2018 07:03:28 UTC+2 schreef Jacob Bachmeyer:
> >
> > Clifford Wolf wrote:
> > > Hi,
> > >
> > > On Sat, Apr 07, 2018 at 07:44:10PM -0500, Jacob Bachmeyer
> wrote:

> [...]

I stand corrected. It still requires a 2-instruction sequence, while
the conversion ROLI->RORI in the assembler merely requires adjusting the
immediate.

> It may be my ignorance, but the R-type form of rotate does not seems
> critical, and if you drop ROLI, dropping ROL makes the encoding more
> regular.

The argument works both ways: dropping ROL eliminates left rotates
entirely, but encoding regularity still suggests placing left rotate
using the $SLL funct3 code where right rotate uses the $SRL funct3 code.

> Shrug.

Agreed.

> [...]

> > Placed here by considering funct3 $SLT and $SLTU as "special
> > function"
> > codes:
> >
> >
> > SLT and SLTU seem the right opcodes to try to reserve for
> instructions
> > "close" to comparison. PULP for example, has min[u] and max[u].
> or for
> > bit manipulation, one could think of a variant of slt and sltu that
> > returns -1 (i.e. all bits 1) instead of 1:
> >
> > negslt rd rs1 rs2 --> slt rd rs1 rs2; sub rd zero rd
> > negsltu rd rs1 rs2 --> sltu rd rs1 rs2; sub rd zero rd
> > negsli rd rs1 imm --> slt rd rs1 imm; sub rd zero rd
> > negslui rd rs1imm --> sltui rd rs1 imm; sub rd zero rd
>
> Another option, instead of using $SLT and $SLTU as "special function"
> codes, is to take the same approach that RVM uses and define a new
> funct7 value and reuse funct3 to distinguish operations.
>
>
> That has some merit, but I like to think of B as just some additional
> more specialised I instructions and I think it should just get as
> little in the way as possible for other extensions that want to define
> variants of existing instructions. Having B instructions close to
> "and", "or" and "xor" as well as sll and srl feels right and is what
> I proposed for hamm/bcnt and clz. That feeling may well be just that
> though: an irrational feeling.

If the CLZ opcode includes an ANDC (clz(rs1 & ~rs2)), encoding it near
ANDC might be an interesting option, especially if the same bit can
select Hamming distance ("HADI"?) when used near XOR.

Hmmm... (<URL:https://en.wikipedia.org/wiki/Hamming_weight>) "Hamming
weight ... is thus equivalent to the Hamming distance from the all-zero
string of the same length." Population count is a special case of
Hamming distance. Does Hamming distance beyond the special case of
population count have enough uses to justify the increased ALU
complexity of using Hamming distance as the basic operation instead of
population count? Hamming distance fits the R-type instruction pattern
better, being 2R1W instead of 1R1W.

> [...]

> > BEXT and BDEP require XLEN-bit control words and therefore
> cannot
> > have
> > an immediate form.
>
>
> One could define BEXTI and BDEPI like BEXT and BDEP but with rs2
> replaced with a _zero_ extended 12 bit immediate to grab and insert in
> the low 12 bits (or effectively the low 11 bits if you insist on sign
> extension of immediates).

Possibly, but there is no particularly good place to encode immediate
forms of these instructions; the only space left in OP-IMM is near the
shift opcodes. (The encodings of GREV/ROL/ROR near shift suggest
similar immediate-form opcodes.) Are B{EXT,DEP}I useful enough to
justify allocating such limited space rather than using a 2-instruction
ADDI/{BEXT,BDEP} pair? If so, B{EXT,DEP} could also be encoded near shift.

I was briefly concerned while reviewing the instruction tables that
RV64I shifts could conflict with RVM. This was in error, since RVM
opcodes exist only in OP, and the expanding "shamt" field exists only in
OP-IMM. This does, however, mean that there is a bit of additional
encoding space near shift for instructions that do not have immediate
forms. Two bits are thus available here (when RV128 is considered) that
cannot be used in OP-IMM near shifts, and are thus available for funct7
codes in OP. The base ISA uses 1/4 (funct7 = 7'b0000000) of this space
and RVM uses another 1/4 (funct7 = 7'b0000001) of this space. The two
remaining funct7 codes are 7'b0000010 and 7'b0000011. This corner of
the encoding is available near all operations in OP, but *not* near the
shifts in OP-IMM. No space is available in OP-IMM near other operations.

-- Jacob

Jacob Bachmeyer

unread,

Apr 11, 2018, 7:00:00 PM4/11/18

to Luke Kenneth Casson Leighton, Rogier Brussee, RISC-V ISA Dev, Clifford Wolf, Samuel Falvo II, Michael Clark, John Hauser

Luke Kenneth Casson Leighton wrote:

> On Tue, Apr 10, 2018 at 5:28 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>
>>> ROR is just ROL with a negative value for rs2 (since rs2 is taken mod
>>> log2(XLEN)) so I think only one of ROL and ROR it is needed.
>>>
>> For hardware opcodes, we may as well have both, since making one of them a
>> pseudo-instruction requires a two-instruction sequence that would clobber a
>> temporary.
>>
>
> And, whilst it is strictly speaking true that ROR = - ROL (and there
> is the intention to have an imm variant?) You require a signed
> immediate to do so, and in doing so you need one more bit to cover
> both cases (one of which now needs 2's complement decoding).
>

Rotates are modulo XLEN, so no sign bit is needed. Modulo 32, 33 is
congruent to 1 is congruent to -31. This is not hard for the assembler
to calculate, but may be an issue for the register-based forms of the
rotate instructions.

> In the registerised version you may not have the negative value in
> order to do the opposite rotate and thus would need to... oh that's
> what you're talking about Jacob :) yes, 50% of the time you would
> need 2 instructions not one.

For the register-based forms of rotate, 2 instructions will always be
needed: one instruction to adjust the rotate amount and one instruction
to perform the rotate, unless the compiler is able to do more complex
program transformations and combine that adjustment with some other step.

-- Jacob

Luke Kenneth Casson Leighton

unread,

Apr 11, 2018, 7:24:00 PM4/11/18

to Jacob Bachmeyer, Rogier Brussee, RISC-V ISA Dev, Clifford Wolf, Samuel Falvo II, Michael Clark, John Hauser

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Wed, Apr 11, 2018 at 11:59 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Luke Kenneth Casson Leighton wrote:

>> And, whilst it is strictly speaking true that ROR = - ROL (and there
>> is the intention to have an imm variant?) You require a signed
>> immediate to do so, and in doing so you need one more bit to cover
>> both cases (one of which now needs 2's complement decoding).
>>
>
>
> Rotates are modulo XLEN, so no sign bit is needed.

oh duh of course :)

Michael Clark

unread,

Apr 11, 2018, 8:51:48 PM4/11/18

to jcb6...@gmail.com, Clifford Wolf, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Samuel Falvo II

> On 10/04/2018, at 4:45 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> The width variant encodings, for any encoding using OP and OP-IMM, are produced by a simple rule: replace OP with OP-32 or OP-64 and OP-IMM with OP-IMM-32 or OP-IMM-64.

I’m happy with that. It certainly makes sense for the analysis stage. Thanks for taking the time to propose an encoding (Jacob, Clifford, et al). I will have a closer look when I get time. I’m certainly very interested in getting the B extension pinned down, especially for the instructions that are already widely-used via compiler intrinsics or compiler lifting.

Michael Clark

unread,

Apr 11, 2018, 8:53:17 PM4/11/18

to Jacob Bachmeyer, Clifford Wolf, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Samuel Falvo II

An example of compiler “lifting” in action here:

https://cx.rv8.io/g/h42x9j

Rogier Brussee

unread,

Apr 12, 2018, 5:18:35 AM4/12/18

to RISC-V ISA Dev, rogier....@gmail.com, clif...@clifford.at, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com, jcb6...@gmail.com

Op donderdag 12 april 2018 00:54:10 UTC+2 schreef Jacob Bachmeyer:

Rogier Brussee wrote:
> Op dinsdag 10 april 2018 06:28:56 UTC+2 schreef Jacob Bachmeyer:
>
> Rogier Brussee wrote:
> >

[snip]

> That has some merit, but I like to think of B as just some additional

> more specialised I instructions and I think it should just get as
> little in the way as possible for other extensions that want to define
> variants of existing instructions. Having B instructions close to
> "and", "or" and "xor" as well as sll and srl feels right and is what
> I proposed for hamm/bcnt and clz. That feeling may well be just that
> though: an irrational feeling.

If the CLZ opcode includes an ANDC (clz(rs1 & ~rs2)), encoding it near
ANDC might be an interesting option, especially if the same bit can
select Hamming distance ("HADI"?) when used near XOR.

HADI suggests an immediate form of HAD. HDST perhaps? Otherwise, good suggestion.

Fitting the R-type instruction was by far the main rationale. In fact I sort of assumed xor (or andc) is sufficiently simple that it is less of a burden than irregular encoding

But hamming distance does naturally count zeros as well as ones and for rs1 != 0 it does ctz in three instructions:

ctz rd rs1 --> addi rd, rs1, -1; hamm rd ,rd, rs1; addi rd, rd, -1

which avoids using a full bitreverse with GREVI which might not always be implemented by hardware.

> [...]
> > BEXT and BDEP require XLEN-bit control words and therefore
> cannot
> > have
> > an immediate form.
>
>
> One could define BEXTI and BDEPI like BEXT and BDEP but with rs2
> replaced with a _zero_ extended 12 bit immediate to grab and insert in
> the low 12 bits (or effectively the low 11 bits if you insist on sign
> extension of immediates).

Possibly, but there is no particularly good place to encode immediate
forms of these instructions; the only space left in OP-IMM is near the
shift opcodes. (The encodings of GREV/ROL/ROR near shift suggest
similar immediate-form opcodes.) Are B{EXT,DEP}I useful enough to
justify allocating such limited space rather than using a 2-instruction
ADDI/{BEXT,BDEP} pair? If so, B{EXT,DEP} could also be encoded near shift.

Don't know, probably not, or at least not for a whole 12 bit immediate.

I was briefly concerned while reviewing the instruction tables that
RV64I shifts could conflict with RVM. This was in error, since RVM
opcodes exist only in OP, and the expanding "shamt" field exists only in
OP-IMM. This does, however, mean that there is a bit of additional
encoding space near shift for instructions that do not have immediate
forms. Two bits are thus available here (when RV128 is considered) that
cannot be used in OP-IMM near shifts, and are thus available for funct7
codes in OP. The base ISA uses 1/4 (funct7 = 7'b0000000) of this space
and RVM uses another 1/4 (funct7 = 7'b0000001) of this space. The two
remaining funct7 codes are 7'b0000010 and 7'b0000011. This corner of
the encoding is available near all operations in OP, but *not* near the
shifts in OP-IMM. No space is available in OP-IMM near other operations.

I see. More reason to be frugal with encodings.

-- Jacob

Clifford Wolf

unread,

Apr 12, 2018, 5:50:17 AM4/12/18

to Jacob Bachmeyer, Rogier Brussee, RISC-V ISA Dev, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

On Mon, Apr 09, 2018 at 11:28:51PM -0500, Jacob Bachmeyer wrote:
> The draft proposes making RORI the hardware op and ROLI an assembler
> pseudo-op. I agree that only one of these is needed, and propose
> reserving the ROLI hardware opcode simply for symmetry with the
> shifts, even though the assembler will not produce that opcode and
> implementations are not expected to implement that opcode. (The
> reservation can be dropped in a later draft if appropriate.)

Instead of reserving ROLI I think it might be a nice spot for putting
singe-argument opcodes (such as pcnt and clz).

Clifford Wolf

unread,

Apr 12, 2018, 5:52:28 AM4/12/18

to Luke Kenneth Casson Leighton, Rogier Brussee, RISC-V ISA Dev, Jacob Bachmeyer, Samuel Falvo II

On Mon, Apr 09, 2018 at 09:12:29PM +0100, Luke Kenneth Casson Leighton wrote:
> i can say more if people are interested, as well as either set up a
> wiki or recommend one that is already up.

I think the most reasonable place would be to use the github wiki from the
xbitmanip repository:

https://github.com/cliffordwolf/xbitmanip/wiki

Andrew Waterman

unread,

Apr 12, 2018, 3:36:00 PM4/12/18

to Clifford Wolf, Jacob Bachmeyer, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

This matches the pattern in the F/D/proposed-V extensions where bits 24:20 are used as an ancillary opcode field when rs2 is not needed.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/20180412095015.GA8654%40clifford.at.

Rogier Brussee

unread,

Apr 13, 2018, 9:37:49 AM4/13/18

to RISC-V ISA Dev, jcb6...@gmail.com, rogier....@gmail.com, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

Op donderdag 12 april 2018 11:50:17 UTC+2 schreef clifford:

Using an R-type instruction even if it is slightly artificial still seems the cheaper thing to do because as far as I know for a processor implementing IMAFDB, CLZ and PCNT would be the _only_ 2 register instructions.

Having said that, I agree with the sentiment that reserving opcode for ROLI takes up relatively precious opcode space that can be put to better uses, and that a home for 2 register instructions (just 32 of them!) may be one of them..

In fact what about SROI. I can see a case for SLOI to construct mask immediates. For SROI I am not so sure, however. You would start with an immediate. Then most masks I can think of (so it maybe me) would be constructed just as well with SLLI, SRAI (which compress), RORI, ANDC (so a negative mask works just as well as a positive) and last not least, a SROI pseudo that is just

SROI rd rs1 imm --> XORI rd, rs1, -1; SRLI rd rd imm; XORI rd rd -1

(where if it follows a LI the first XORI can be absorbed in the immediate of the LI, and the last XORI can be absorbed in an ANDC if the next instruction is an AND instruction. Hell, the OP opcode space for R-type instructions has comparatively lots of room, so an ORC instruction encoded like ANDC but with func3 = OR to absorb not's may have better value for money)

Jacob Bachmeyer

unread,

Apr 13, 2018, 7:45:50 PM4/13/18

to Rogier Brussee, RISC-V ISA Dev, clif...@clifford.at, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

Rogier Brussee wrote:
> Op donderdag 12 april 2018 00:54:10 UTC+2 schreef Jacob Bachmeyer:
>
> Rogier Brussee wrote:
>

> [...]

>
>
> > That has some merit, but I like to think of B as just some
> additional
>
> > more specialised I instructions and I think it should just get as
> > little in the way as possible for other extensions that want to
> define
> > variants of existing instructions. Having B instructions close to
> > "and", "or" and "xor" as well as sll and srl feels right and is
> what
> > I proposed for hamm/bcnt and clz. That feeling may well be just
> that
> > though: an irrational feeling.
>
> If the CLZ opcode includes an ANDC (clz(rs1 & ~rs2)), encoding it
> near
> ANDC might be an interesting option, especially if the same bit can
> select Hamming distance ("HADI"?) when used near XOR.
>
>
> HADI suggests an immediate form of HAD. HDST perhaps? Otherwise, good
> suggestion.

HDST suggests some kind of STORE. HDIS perhaps?

-- Jacob

Jacob Bachmeyer

unread,

Apr 13, 2018, 9:25:29 PM4/13/18

to Rogier Brussee, RISC-V ISA Dev, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

Rogier Brussee wrote:
> Op donderdag 12 april 2018 11:50:17 UTC+2 schreef clifford:
>
> On Mon, Apr 09, 2018 at 11:28:51PM -0500, Jacob Bachmeyer wrote:
> > The draft proposes making RORI the hardware op and ROLI an
> assembler
> > pseudo-op. I agree that only one of these is needed, and propose
> > reserving the ROLI hardware opcode simply for symmetry with the
> > shifts, even though the assembler will not produce that opcode and
> > implementations are not expected to implement that opcode. (The
> > reservation can be dropped in a later draft if appropriate.)
>
> Instead of reserving ROLI I think it might be a nice spot for putting
> singe-argument opcodes (such as pcnt and clz).
>
>
>
> Using an R-type instruction even if it is slightly artificial still
> seems the cheaper thing to do because as far as I know for a
> processor implementing IMAFDB, CLZ and PCNT would be the _only_ 2
> register instructions.
>
> Having said that, I agree with the sentiment that reserving opcode for
> ROLI takes up relatively precious opcode space that can be put to
> better uses, and that a home for 2 register instructions (just 32 of
> them!) may be one of them..

The problem with this approach is that ROLI would be in OP-IMM; R-type
instructions go in OP, not OP-IMM.

> In fact what about SROI. I can see a case for SLOI to construct mask
> immediates. For SROI I am not so sure, however.

SROI is similar, when there is a need to extract high-order bits. This
is not much of an issue on RV32, but becomes a bigger problem on RV64
and RV128, where LI can be much longer than 2 instructions.

There is also the issue of encoding regularity, which is reduced by
omitting instructions like this.

> You would start with an immediate. Then most masks I can think of (so
> it maybe me) would be constructed just as well with SLLI, SRAI (which
> compress), RORI, ANDC (so a negative mask works just as well as a
> positive) and last not least, a SROI pseudo that is just
>
> SROI rd rs1 imm --> XORI rd, rs1, -1; SRLI rd rd imm; XORI rd rd -1
>
> (where if it follows a LI the first XORI can be absorbed in the
> immediate of the LI, and the last XORI can be absorbed in an ANDC if
> the next instruction is an AND instruction. Hell, the OP opcode space
> for R-type instructions has comparatively lots of room, so an ORC
> instruction encoded like ANDC but with func3 = OR to absorb not's
> may have better value for money)

That would be an argument against the shift-ones instructions, since the
same works for SLOI/SLLI. I believe that the shift-ones instructions,
particularly with an encoding using bit 29 to select shift-ones are
simple enough to be worthwhile and that the entire set of
SLO/SRO/SLOI/SROI together are simpler to implement than any subset.

-- Jacob

Clifford Wolf

unread,

Apr 14, 2018, 6:50:42 AM4/14/18

to Rogier Brussee, RISC-V ISA Dev, jcb6...@gmail.com, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

Hi,

On Fri, Apr 13, 2018 at 06:37:49AM -0700, Rogier Brussee wrote:
> Using an R-type instruction even if it is slightly artificial still seems
> the cheaper thing to do because as far as I know for a processor
> implementing IMAFDB, CLZ and PCNT would be the _only_ 2 register
> instructions.

For an R-type instruction the ALU would usually not see the value encoded
in bits 24:20 of the instructions word, because it would be decoded as
register address and thus the value forwarded to the ALU would be the value
stored in that address.

For an I-type instruction however this field is decoded as part of the
immediate and forwarded to the ALU.

Thus using an I-type instruction to encode unary instructions is more
efficient because it does not require special handling of those bits in the
instruction decoder.

regards,
- clifford

PS: The current XBitmanip draft spec contains an additional chapter
titled "Alternatives to bext and bdep" that proposes the two additional
unary instructions ZIP and UNZIP:
https://raw.githubusercontent.com/cliffordwolf/xbitmanip/master/xbitmanip-draft.pdf

Rogier Brussee

unread,

Apr 14, 2018, 5:47:19 PM4/14/18

to RISC-V ISA Dev, rogier....@gmail.com, clif...@clifford.at, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com, jcb6...@gmail.com

Op zaterdag 14 april 2018 01:45:50 UTC+2 schreef Jacob Bachmeyer:

HDIS or HDIST works for me.

-- Jacob

Rogier Brussee

unread,

Apr 14, 2018, 6:00:55 PM4/14/18

to RISC-V ISA Dev, rogier....@gmail.com, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com, jcb6...@gmail.com

Op zaterdag 14 april 2018 03:25:29 UTC+2 schreef Jacob Bachmeyer:

Not wanting to make an elephant out of a midget here, but I still don't see why you couldn't just do

li rd -1; slli rd rd -8

(Which compress) to create a mask for the top byte of a 64 or 128 bit integer.

There is also the issue of encoding regularity, which is reduced by
omitting instructions like this.

> You would start with an immediate. Then most masks I can think of (so
> it maybe me) would be constructed just as well with SLLI, SRAI (which
> compress), RORI, ANDC (so a negative mask works just as well as a
> positive) and last not least, a SROI pseudo that is just
>
> SROI rd rs1 imm --> XORI rd, rs1, -1; SRLI rd rd imm; XORI rd rd -1
>
> (where if it follows a LI the first XORI can be absorbed in the
> immediate of the LI, and the last XORI can be absorbed in an ANDC if
> the next instruction is an AND instruction. Hell, the OP opcode space
> for R-type instructions has comparatively lots of room, so an ORC
> instruction encoded like ANDC but with func3 = OR to absorb not's
> may have better value for money)

That would be an argument against the shift-ones instructions, since the
same works for SLOI/SLLI.

Yes especially as I see there is a (destructive C.not) operation.

I believe that the shift-ones instructions,
particularly with an encoding using bit 29 to select shift-ones are
simple enough to be worthwhile and that the entire set of
SLO/SRO/SLOI/SROI together are simpler to implement than any subset.

Ok. I believe you. Just striving for minimality.

-- Jacob

Rogier Brussee

unread,

Apr 14, 2018, 6:08:31 PM4/14/18

to RISC-V ISA Dev, rogier....@gmail.com, jcb6...@gmail.com, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

Op zaterdag 14 april 2018 12:50:42 UTC+2 schreef clifford:

[

[snip]

PS: The current XBitmanip draft spec contains an additional chapter
titled "Alternatives to bext and bdep" that proposes the two additional
unary instructions ZIP and UNZIP:
https://raw.githubusercontent.com/cliffordwolf/xbitmanip/master/xbitmanip-draft.pdf

Observation : all the pseudo's for GREVI have either k = 2^N or k = -2^N. In particular the crucial pseudo' bswap have k = -8 and brev = -1.

Do GREVI's for k = - 2^N also have a simpler structure ?

Jacob Bachmeyer

unread,

Apr 14, 2018, 8:07:32 PM4/14/18

to Rogier Brussee, RISC-V ISA Dev, clif...@clifford.at, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

And a few hours later I thought of another option: HADS ("HAmming
DiStance"). This one has the extra advantage of being pronounceable as
a single syllable, while not actually being a "real word".

-- Jacob

Jacob Bachmeyer

unread,

Apr 14, 2018, 8:26:42 PM4/14/18

to Rogier Brussee, RISC-V ISA Dev, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

Other uses of SROI include in operations like finding the number of
trailing zeroes in the high part of a word: SROI/CTZ. These are
probably rare, but shift-ones-right is neither useless nor covered by
another operation as ROLI is. (While ROLI can be implemented as a
1-instruction RORI sequence with an adjusted immediate, no similar gain
can be had with ROL: its pseudo instruction sequence expands to two
instructions, one to adjust the amount and then ROR to do the rotate.)

> [...]

>
>
> I believe that the shift-ones instructions,
> particularly with an encoding using bit 29 to select shift-ones are
> simple enough to be worthwhile and that the entire set of
> SLO/SRO/SLOI/SROI together are simpler to implement than any subset.
>
>
> Ok. I believe you. Just striving for minimality.

In this case, I think that more instructions (the complete shift-ones
set) paradoxically make for minimal hardware. :-)

-- Jacob

Jacob Bachmeyer

unread,

Apr 14, 2018, 8:47:22 PM4/14/18

to Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

Clifford Wolf wrote:
> Hi,
>
> On Fri, Apr 13, 2018 at 06:37:49AM -0700, Rogier Brussee wrote:
>
>> Using an R-type instruction even if it is slightly artificial still seems
>> the cheaper thing to do because as far as I know for a processor
>> implementing IMAFDB, CLZ and PCNT would be the _only_ 2 register
>> instructions.
>>
>
> For an R-type instruction the ALU would usually not see the value encoded
> in bits 24:20 of the instructions word, because it would be decoded as
> register address and thus the value forwarded to the ALU would be the value
> stored in that address.
>
> For an I-type instruction however this field is decoded as part of the
> immediate and forwarded to the ALU.
>
> Thus using an I-type instruction to encode unary instructions is more
> efficient because it does not require special handling of those bits in the
> instruction decoder.
>

Instead they are presented to the ALU for second-stage decoding... this
is starting to make sense to me.

> regards,
> - clifford
>
> PS: The current XBitmanip draft spec contains an additional chapter
> titled "Alternatives to bext and bdep" that proposes the two additional
> unary instructions ZIP and UNZIP:
> https://raw.githubusercontent.com/cliffordwolf/xbitmanip/master/xbitmanip-draft.pdf
>

While ZIP and UNZIP seem good and as fixed permutations are cheap in
hardware, as of commit 703dedf, GREVM has broken psuedocode that
references rs1 twice in its argument list and has undefined behavior (b
is initialized to 2*b, which would be a garbage value in C, and then
used). Does GREVM use two registers and an immediate? There is no room
to encode that in 32-bit space.

That said, GREVM uses the lower XLEN/2 bits of its second register;
could the upper XLEN/2 bits be used for N? Or is the N parameter in the
domain [0, log2(XLEN)], which would reduce it to 3 bits?

Hmm... GREVM must be encoded in OP, since it needs two input registers
and a result. It could be encoded near GREV by additionally setting bit
28 and using bits [27:25] to store N. Another option is to split GREVM
between the $SLL and $SRL funct3 codes, using bits {14, 26, 25} to store
N and bit 27 to select GREVM.

Either way, GREVM is expensive to encode, requiring 8 R-type opcode slots.

-- Jacob

Clifford Wolf

unread,

Apr 15, 2018, 1:02:45 PM4/15/18

to Jacob Bachmeyer, Rogier Brussee, RISC-V ISA Dev, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

Hi,

On Sat, Apr 14, 2018 at 07:47:20PM -0500, Jacob Bachmeyer wrote:
> While ZIP and UNZIP seem good and as fixed permutations are cheap in
> hardware, as of commit 703dedf, GREVM has broken psuedocode that
> references rs1 twice in its argument list and has undefined behavior
> (b is initialized to 2*b, which would be a garbage value in C, and
> then used).

The 2*b thing was already fixed in the .tex but I did not update the .pdf.
I've now fixed the rs1 issue together with the missing return in swapbits
and updated the pdf.

> Does GREVM use two registers and an immediate? There is no room to
> encode that in 32-bit space.

That's why GREVM is not one instruction, it is 5 instructions (6 on RV64),
as described in the draft spec.

> That said, GREVM uses the lower XLEN/2 bits of its second register;
> could the upper XLEN/2 bits be used for N? Or is the N parameter in
> the domain [0, log2(XLEN)], which would reduce it to 3 bits?

Yes, the N partameter is < log2(XLEN).

> Hmm... GREVM must be encoded in OP, since it needs two input
> registers and a result. It could be encoded near GREV by
> additionally setting bit 28 and using bits [27:25] to store N.
> Another option is to split GREVM between the $SLL and $SRL funct3
> codes, using bits {14, 26, 25} to store N and bit 27 to select
> GREVM.

Discussing conrcrete encodings is a distraction at this point. It's a
complete waste of everyones time.

The only thing that is relevant at this stage is how much encoding space
the instruction needs. The conrete encoding can be discussed once the exact
instruction mix has been decided.

GREVM adds about 40% to the required encoding space of xbitmanip. (ZIP
and UNZIP is negligible.)

It is still a fraction of the encoding space of a single full-sized
I-type instruction (less than a quarter on RV32).

See also my post to the xbitmanip list:
https://groups.google.com/forum/#!topic/riscv-xbitmanip/0Hjr2f80ibQ

> Either way, GREVM is expensive to encode, requiring 8 R-type opcode slots.

No. It requires 5 R-type opcodes on RV32 and 6 on RV64. (Not counting *W
instructions on RV64.)

regards,
- clifford

clifford

unread,

Apr 15, 2018, 2:09:39 PM4/15/18

to RISC-V ISA Dev

Hi,

just two little notes re the OP:

On Saturday, March 12, 2016 at 8:28:38 AM UTC+1, Michael Clark wrote:

* bsrm - Shift Right and Mask (w,d,q) C expression: (src >> start) & ((1 << masklen)-1)

That's just a left shift of xlen-start-masklen followed by a right shift or xlen-masklen. Compressed encodings exist for both operations, so the sequence fits in 32 bits. An implementation can chose to macro-op fuse this sequence into a single micro instruction. No need to create a dedicated instruction for something like this.

(Further advantage of upshift-downshift for extracting a bitfield: You can sign extend by using an arithmetic right shift.)

RISC-V Instruction Decoding

I thought I'd just leave this here: The following function decodes a B-type immediate using the "draft-draft" opcodes grevm3 and zip.

decode_b:

li t0, 0x01ff0000

grevm3 a0, a0, t0

zip a0, a0

li t0, 0xeaa80055

bext a0, a0, t0

c.slli a0, 20

c.srai a0, 19

ret

(You can also use bext twice on the two subsets of immediate bits that are in-order and then assemble the pieces, but afaics that requires at least one more instruction depending on how one counts the li instructions, and bext is possibly a slow multi-cycle op depending on the implementation, so using it only once is also a big plus.)

regards,

- clifford

Rogier Brussee

unread,

Apr 15, 2018, 4:32:46 PM4/15/18

to RISC-V ISA Dev, rogier....@gmail.com, clif...@clifford.at, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com, jcb6...@gmail.com

Op zondag 15 april 2018 02:07:32 UTC+2 schreef Jacob Bachmeyer:

When properly pronounced HDIS seems a Hell of a name ;-)

-- Jacob

Jacob Bachmeyer

unread,

Apr 15, 2018, 10:59:29 PM4/15/18

to Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

Clifford Wolf wrote:
> On Sat, Apr 14, 2018 at 07:47:20PM -0500, Jacob Bachmeyer wrote:

> [...]

>> Hmm... GREVM must be encoded in OP, since it needs two input
>> registers and a result. It could be encoded near GREV by
>> additionally setting bit 28 and using bits [27:25] to store N.
>> Another option is to split GREVM between the $SLL and $SRL funct3
>> codes, using bits {14, 26, 25} to store N and bit 27 to select
>> GREVM.
>>
>
> Discussing conrcrete encodings is a distraction at this point. It's a
> complete waste of everyones time.
>
> The only thing that is relevant at this stage is how much encoding space
> the instruction needs. The conrete encoding can be discussed once the exact
> instruction mix has been decided.
>
> GREVM adds about 40% to the required encoding space of xbitmanip. (ZIP
> and UNZIP is negligible.)
>
> It is still a fraction of the encoding space of a single full-sized
> I-type instruction (less than a quarter on RV32).
>
> See also my post to the xbitmanip list:
> https://groups.google.com/forum/#!topic/riscv-xbitmanip/0Hjr2f80ibQ
>
>
>> Either way, GREVM is expensive to encode, requiring 8 R-type opcode slots.
>>
>
> No. It requires 5 R-type opcodes on RV32 and 6 on RV64. (Not counting *W
> instructions on RV64.)
>

And 7 slots on RV128, since N is less than log2(XLEN). This would allow
N to be encoded as N+1 in three bits, leaving 3'b000 for other
(existing) opcodes. The *W and *D variants are effectively free as long
as the base variant is encoded in OP/OP-IMM.

-- Jacob

Clifford Wolf

unread,

Apr 16, 2018, 4:40:41 AM4/16/18

to Jacob Bachmeyer, Rogier Brussee, RISC-V ISA Dev, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

On Sun, Apr 15, 2018 at 09:59:24PM -0500, Jacob Bachmeyer wrote:
> And 7 slots on RV128, since N is less than log2(XLEN). This would
> allow N to be encoded as N+1 in three bits, leaving 3'b000 for other
> (existing) opcodes.

I'd use 000-110 for GREVM and 111 for another opcode. Would make sense to
put GREV in that last slot, but as I said, spending much time on this
discussion is a distraction at the current point.

Dan Hopper

unread,

Apr 16, 2018, 2:31:15 PM4/16/18

to clifford, RISC-V ISA Dev

Hi Clifford,

I had a separate note last week started, thanking you and the other folks for picking up work on the BitManip spec, but got side-tracked before finishing it. Thanks for doing so!

One of the things I noticed while reviewing the new draft - clz no longer includes the result + immediate adder delay at the end, and that's a good thing to have eliminated.

Adding 2B instr. encodings for common cases is also wise.

Adding 32-bit operand size instruction variants also is positive. Who's to say what a given implementation's max logic depth will be, if whether or not some of these new ops like popcnt and so on might be two-cycle 64-bit ops but one-cycle 32-bit ops? Best not to assume that everyone's going to be fine running 64-bit ALU ops for 32-bit data. (And 32-bit operations save power vs 64-bit ops.)

Regarding the below comment on macro-op fusion - I don't think we should assume that all RISC-V implementations that care about performance have designed instruction fusion logic into their instr. decoders. Isn't it better to make sure that the ops (that we care about the performance of) map to a single instruction where possible, so that they don't take a perf hit on fusion-less machines?

(I realize there's a sentiment from the original RISC-V architects to the effect that all performant RISC-V implementation need to implement instruction fusion. But I strongly disagree. If other competing ISAs don't require fusion logic for similar operations (or the extra I-stream bandwidth requirements it encourages), then shouldn't RISC-V try to be competition here as well?)

Thanks!

Dan Hopper

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9a6b5ceb-871c-4e3e-9e0c-050f48f95e05%40groups.riscv.org.

Luke Kenneth Casson Leighton

unread,

Apr 16, 2018, 2:46:38 PM4/16/18

to Dan Hopper, clifford, RISC-V ISA Dev

On Mon, Apr 16, 2018 at 7:31 PM, Dan Hopper <dho...@tesla.com> wrote:

> Adding 32-bit operand size instruction variants also is positive. Who's to
> say what a given implementation's max logic depth will be, if whether or not
> some of these new ops like popcnt and so on might be two-cycle 64-bit ops
> but one-cycle 32-bit ops? Best not to assume that everyone's going to be
> fine running 64-bit ALU ops for 32-bit data. (And 32-bit operations save
> power vs 64-bit ops.)

would also be great to be inclusive of 16-bit in order to cover SIMD
and vector-parallelism scenarios. does not need special 16-bit
opcodes: bitwidth can be indirectly stored in a CSR or a prefix
instruction "following instruction is to be SIMD or vector and is to
have its operand size over-ridden from standard RV32 or RV64 meaning
to 16-bit".

l.

Clifford Wolf

unread,

Apr 17, 2018, 6:11:06 AM4/17/18

to Clifford Wolf, Jacob Bachmeyer, Rogier Brussee, RISC-V ISA Dev, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

Hi,

Forget everything about my GREVM[0-6] proposal. I have a *much* better and

cleaner solution. See "generalized bit permutation instructions 'shuffle' and 'unshuffle'"

in the updated draft spec:

https://raw.githubusercontent.com/cliffordwolf/xbitmanip/master/xbitmanip-draft.pdf

They look complicated, but they are very cheap to implement (for the most part they

just reuse the butterfly network that's already there for GREV), and they only require

two additional R-type opcodes instead of the ~7 from the previous proposal.

There is also a lot of free encoding space in the command word. So any other permutation

(or really any other parameterized unary function) that might be interesting enough to add

HW support for, but not sufficiently interesting to reserve an extra instruction or make it mandatory

for all implementations, could be implemented as optional feature within shuffle/unshuffle.

regards,

- clifford

Clifford Wolf

unread,

Apr 17, 2018, 7:42:58 AM4/17/18

to Dan Hopper, RISC-V ISA Dev

Hi,

On Mon, Apr 16, 2018 at 8:31 PM, Dan Hopper <dho...@tesla.com> wrote:

I had a separate note last week started, thanking you and the other folks for picking up work on the BitManip spec, but got side-tracked before finishing it. Thanks for doing so!

Thanks.

Regarding the below comment on macro-op fusion - I don't think we should assume that all RISC-V implementations that care about performance have designed instruction fusion logic into their instr. decoders. Isn't it better to make sure that the ops (that we care about the performance of) map to a single instruction where possible, so that they don't take a perf hit on fusion-less machines?

(I realize there's a sentiment from the original RISC-V architects to the effect that all performant RISC-V implementation need to implement instruction fusion. But I strongly disagree. If other competing ISAs don't require fusion logic for similar operations (or the extra I-stream bandwidth requirements it encourages), then shouldn't RISC-V try to be competition here as well?)

The operation in question was the following:

On Saturday, March 12, 2016 at 8:28:38 AM UTC+1, Michael Clark wrote:

* bsrm - Shift Right and Mask (w,d,q) C expression: (src >> start) & ((1 << masklen)-1)

And the macro-op fusable sequence that I proposed for it was this:

c.slli a0, (32-START-LEN)

c.srli a0, (32-LEN)

If we want to support something like "bsrm" all the way to RV128 then the instruction would have a 2x7 = 14 bit immediate.

That instruction would be half of a major opcode (!), or over 32x times the size of the entire xbitmanip extension as

currently proposed.

Also, some implementations might not actually want to spend the extra area for a second barrel shifter for this

operation. With macro-op-fusion it is up to the individual implementation if they want to support this operation in a

single micro-op or in two micro-ops.

While I agree that generally speaking macro-op-fusion is an advanced technique that only a small number of cores

will support, I'd argue that macro-op-fusion of two compressed 16-bit opcodes into one 32-bit opcode is a special

case that is relatively easy to support.

Re macro-op-fusion in competing architectures: x86 *does* perform macro-op-fusion, just not for this function.

The x86 BMI1 "bextr" instruction (the register variant) is a 5 byte opcode. (It would be possible to implement a similar

functionality as command within the "shuffle" instruction, see current xbitmanip-draft.pdf. But to be honest, I don't think

that would be worth it.)

The x86 TBM "bextr" instruction (that is the one with immediates) is a 7 byte instruction. Trying the squeeze the same

functionality in a 4 byte RISC-V instruction, just because another already existing 4 byte encoding might be a bit harder

to decode, seems like a big waste of RISC-V instruction encoding space to me.

I hope you can agree that this are all good arguments against a dedicated "bsrm" instruction.

Regards,

- clifford

Dan Hopper

unread,

Apr 17, 2018, 9:42:52 AM4/17/18

to Clifford Wolf, RISC-V ISA Dev

Hi Clifford,

Ah, yes I do agree in this bsrm case. Thanks for the detailed explanation. Also wrt the 2 * 2B instruction fusion case not being difficult. Many of the pre-existing fusion cases, however, don't fit into this convenient scenario, thus my general concern. (Apparently misplaced for the bsrm case, however.)

I'm not 100% on the same page wrt the x86 fusion examples, as those are just ordinary variable-length x86 instructions, handled by any compliant x86 decoder, where valid instruction lengths are 1-15 bytes. True, it's not reasonable to expect the same functionality to be packed into a shorter instruction length, your main point.

(In a separate discussion, I might argue for supporting additional RISC-V instruction lengths of 6B and 8B as being preferable to implementing instruction fusion. The former is mostly limited to the decoder logic, the latter also ripples down to retire/exceptions, since you have to be able to split up the fused op's retires in certain exceptional cases.)

Thanks,

Dan

Andrew Waterman

unread,

Apr 17, 2018, 12:28:27 PM4/17/18

to Clifford Wolf, Dan Hopper, RISC-V ISA Dev

On Tue, Apr 17, 2018 at 4:42 AM Clifford Wolf <cliffor...@gmail.com> wrote:

Hi,

On Mon, Apr 16, 2018 at 8:31 PM, Dan Hopper <dho...@tesla.com> wrote:

I had a separate note last week started, thanking you and the other folks for picking up work on the BitManip spec, but got side-tracked before finishing it. Thanks for doing so!

Thanks.

Regarding the below comment on macro-op fusion - I don't think we should assume that all RISC-V implementations that care about performance have designed instruction fusion logic into their instr. decoders. Isn't it better to make sure that the ops (that we care about the performance of) map to a single instruction where possible, so that they don't take a perf hit on fusion-less machines?

(I realize there's a sentiment from the original RISC-V architects to the effect that all performant RISC-V implementation need to implement instruction fusion. But I strongly disagree. If other competing ISAs don't require fusion logic for similar operations (or the extra I-stream bandwidth requirements it encourages), then shouldn't RISC-V try to be competition here as well?)

The operation in question was the following:

On Saturday, March 12, 2016 at 8:28:38 AM UTC+1, Michael Clark wrote:
* bsrm - Shift Right and Mask (w,d,q) C expression: (src >> start) & ((1 << masklen)-1)
And the macro-op fusable sequence that I proposed for it was this:

c.slli a0, (32-START-LEN)
c.srli a0, (32-LEN)

If we want to support something like "bsrm" all the way to RV128 then the instruction would have a 2x7 = 14 bit immediate.

That instruction would be half of a major opcode (!), or over 32x times the size of the entire xbitmanip extension as
currently proposed.

Also, some implementations might not actually want to spend the extra area for a second barrel shifter for this
operation. With macro-op-fusion it is up to the individual implementation if they want to support this operation in a
single micro-op or in two micro-ops.

While I agree that generally speaking macro-op-fusion is an advanced technique that only a small number of cores
will support, I'd argue that macro-op-fusion of two compressed 16-bit opcodes into one 32-bit opcode is a special
case that is relatively easy to support.

Re macro-op-fusion in competing architectures: x86 *does* perform macro-op-fusion, just not for this function.

Modern ARM implementations do, too. Nothing special about RISC-V here.

The x86 BMI1 "bextr" instruction (the register variant) is a 5 byte opcode. (It would be possible to implement a similar
functionality as command within the "shuffle" instruction, see current xbitmanip-draft.pdf. But to be honest, I don't think
that would be worth it.)

The x86 TBM "bextr" instruction (that is the one with immediates) is a 7 byte instruction. Trying the squeeze the same
functionality in a 4 byte RISC-V instruction, just because another already existing 4 byte encoding might be a bit harder
to decode, seems like a big waste of RISC-V instruction encoding space to me.

I hope you can agree that this are all good arguments against a dedicated "bsrm" instruction.

Regards,
- clifford

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAG5EYeWQT%2BRQ6G98ukEaqtgxdMEFGx_RMmqgELpC3NF1MWpcqw%40mail.gmail.com.

Jacob Bachmeyer

unread,

Apr 17, 2018, 10:44:09 PM4/17/18

to Clifford Wolf, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, lk...@lkcl.net, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

I like the concept, but I suspect that it will have the same problem as
the field-extract and field-deposit operations I proposed (message-id
<58968596...@gmail.com>
<URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/58968596.7060903%40gmail.com>)
earlier -- the use of a command word to control the operation. I think
that message is where I first proposed what became the "shift-ones"
instructions in the current draft, since they eliminate the most
complicated part of preparing a mask for field-extract/field-deposit.
(The assembler and C example might not match in that message, if I
remember correctly.)

While we should consider various possibilities (in particular operations
that may prove easy for compilers to use), I think that more complex bit
shuffles than GREV may be less helpful for the initial RVB.

-- Jacob

Allen Baum

unread,

Apr 20, 2018, 11:02:26 AM4/20/18

to Jacob Bachmeyer, Clifford Wolf, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

I really need to look at how butterfly is actually implemented in HW.

This is basically a GREV with a specific power-of-2 swap distance, but with individual bit enables.

A question: does

zip + butterfly(mask, 0)

mean zip followed by butterfly, { butterfly(zip()) }

or butterfly followed by zip, { zip(butterfly()) }

or something different ?

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5AD6B0F4.9070105%40gmail.com.

Clifford Wolf

unread,

Apr 20, 2018, 11:46:42 AM4/20/18

to Allen Baum, Jacob Bachmeyer, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

Hey Allen,

On Fri, Apr 20, 2018 at 5:02 PM, Allen Baum <allen...@esperantotech.com> wrote:

I really need to look at how butterfly is actually implemented in HW.
This is basically a GREV with a specific power-of-2 swap distance, but with individual bit enables.

Yes.

A question: does

zip + butterfly(mask, 0)

mean zip followed by butterfly, { butterfly(zip()) }
or butterfly followed by zip, { zip(butterfly()) }
or something different ?

It means a zip followed by a butterfly(mask, 0).

This is equivalent to a butterfly(x, mask, LOG2_XLEN-1) followed by a zip.

The best way to understand those operations imo is to think of them as operations on the bit

indices. E.g. a grev(x, 1) operation (i.e. butterfly(x, -1, 0)) swaps even and odd bits, or in

other words, flips the LSB bit (bit number 0) of the bit indices. (I.e. the grev operation

performs a XOR of the bit indices with the bit pattern in rs2.)

Then you can think of the interaction between zip/unzip and butterfly the following way:

For a butterfly(x, mask, N) the pairs of bits to swap are bits whose indices are only different

in position N. I.e. a butterfly(x, mask, N) swaps bits by flipping bit N in the position indices.

The zip and unzip operations perform a shift rotate on the bit indices. Zip rotates left and unzip

rotates right.

So assuming RV32, "butterfly(x, mask, 4) + zip" will toggle bit 4 in the bit indices and then rotate

left by one bit (so that bit 4 is now bit 0 because bit indices in 32 bit words are 5 bit wide), which

is of course the same as first rotating left one bit and then toggling the least significant bit, i.e.

"zip + butterfly(x, mask, 0)".

Btw, flip_alt_check() in cproofs/shuffle.cc proves a similar equivalence for the flip operation

("unzip + butterfly(x, mask, 4)" is equivalent to "butterfly(x, mask, 0) + unzip"). See cproofs/shuffle.sh

for the cbmc command line for proving this. (It is possible that this does not work with the

relatively old cbmc versions that are packaged with many distributions, but it works as expected

with cbmc git head. :)

kind regards,

- Clifford

PS: I'm right now in the middle of an editing sprint in the xbitmanip document. I'll bump the version

number after that and send a short announcement to isa-dev and the xbitmanip google group.

Stay tuned! :)

Allen Baum

unread,

Apr 20, 2018, 7:52:25 PM4/20/18

to Clifford Wolf, Jacob Bachmeyer, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

As I had mentioned in the birt manipulation group meetings, you can implement butterfly with all1s mask using exisitng shift logic.

Once you add the mask, though, all bets are off; it turns out to be nasty, as you need control each individual swap (as opposed to a shifter, which has one control for each log2 shift). The amount of logic is trivisl; the amount of wiring is horrible. I'll think about how to duce that to something more reasonable ( shifters already hav a lot of horizontal wires because you need to end lsbs to the msbs and vice versa).

Separately: it was mentioned at Coolchips that there is no big-endian support in RiscV.

The butterfly/grev ops can alleviate that - but it might be easier if the mask is inverted (so 0 mean swap, and 1 means dont).

They control constants in the normal cases can more easily be generated, as all the upper bits ("including mask) are zeroes.

This is less than a perfect solution; a better solution is another B-extension (Big_endian).

Clifford Wolf

unread,

Apr 21, 2018, 5:54:39 AM4/21/18

to Allen Baum, Jacob Bachmeyer, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

Hi,

On Sat, Apr 21, 2018 at 1:52 AM, Allen Baum <allen...@esperantotech.com> wrote:

As I had mentioned in the birt manipulation group meetings, you can implement butterfly with all1s mask using exisitng shift logic.

yes. rotate shift and grev is almost the same circuit. (Same number of gates, just slightly different wiring. The stages of rotate shift add 16, 8, 4, 2, 1 to the bit indices, and the stages of GREV xor the bit indices with those values. The potential for routing congestions should be about the same.)

Once you add the mask, though, all bets are off; it turns out to be nasty, as you need control each individual swap (as opposed to a shifter, which has one control for each log2 shift). The amount of logic is trivisl; the amount of wiring is horrible.

Yes, you end up with log2(xlen) stage enable lines and xlen/2 bit control lines. If you AND them near the muxes then wiring should not be a big deal. Routing the xlen/2 control lines next to the data lines down the stages only adds +50% more wires in that direction, and there is always already a data wire running that mux->mux path that the control line must go, so you need a bit extra space for that wire in that channel, but the routing channel must already be there anyways.

So I can see that the wiring is a bit worse compared to a GREV without the control lines (or ROT), but I do not see how if a regular ROT or GREV is okay, and a GREV with that control lines suddenly is "horrible". But I'm eager to learn if there is something crucial that I am missing.

I'll think about how to duce that to something more reasonable ( shifters already hav a lot of horizontal wires because you need to end lsbs to the msbs and vice versa).

Separately: it was mentioned at Coolchips that there is no big-endian support in RiscV.
The butterfly/grev ops can alleviate that - but it might be easier if the mask is inverted (so 0 mean swap, and 1 means dont).
They control constants in the normal cases can more easily be generated, as all the upper bits ("including mask) are zeroes.
This is less than a perfect solution; a better solution is another B-extension (Big_endian).

Simple GREV should be sufficient for endianess conversion. (The "bswap" pseudo-instruction in the current draft spec.)

Can you explain to me in what way it would be better to use 0 for swap in the mask? From a circuit perspective it should not make a difference (swap and no-swap are symmetrical), and from an ISA perspective 1 for swap is better imo because that allows us to use rs2=x0 for zip (i.e. perfect outer shuffle), with all swaps taken that would be a perfect inner shuffle instead, which I think is the less obvious choice as default.

If you want to perform a single butterfly stage with "all mask bits set" then you'd simply use GREVI, not SHUFFLE.

regards,

- Clifford

PS: Plain text e-mail is a horrible medium for discussing circuit topology and ASIC layout. :) So please forgive me if I misunderstood any of your points. I'd be happy to send (links to) cell phone images of crappy sketches of what I mean and would welcome the same.

Allen Baum

unread,

Apr 21, 2018, 9:18:56 AM4/21/18

to Clifford Wolf, Jacob Bachmeyer, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

I totally agree that email is a bad medium to discuss this- I was going to say the same thing. I would love to see your sketches.

I was hoping to almost completely reuse the shifter for grev - for all the most useful cases ( which I thought were the k=2^n cases)- but the really useful ones are k=-2^n cases. So, you have to add logic and double the horizontal wires. (I always draw datapaths from top to bottom). That’s before the mask bits are implemented. I need to think more about that- I think I see what you mean, I need to draw it. Probably after I get home next week..

The reason I thought that the mask should be inverted was because I thought the common case ( all swaps enabled, e.g. for big-median support) it would be much more difficult to generate a constant with many 1s, but many 0s is easy.

Finally- it isn’t clear to me how to implement a load signed big endian halfword, even with with grev.

-Allen

Clifford Wolf

unread,

Apr 21, 2018, 2:47:50 PM4/21/18

to Allen Baum, Jacob Bachmeyer, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

Hi,

On Sat, Apr 21, 2018 at 3:18 PM, Allen Baum <allen...@esperantotech.com> wrote:

I totally agree that email is a bad medium to discuss this- I was going to say the same thing. I would love to see your sketches.

I'll doodle something tomorrow. (I'm done thinking about bit permutations for today.. :)

I was hoping to almost completely reuse the shifter for grev - for all the most useful cases ( which I thought were the k=2^n cases)- but the really useful ones are k=-2^n cases.

So for that you'd need a "backup" fully functioning GREV unit for the other cases, right? This one might be an option (32 bit):

https://github.com/cliffordwolf/xbitmanip/blob/master/verilog/tinygrev.v

This takes 5 cycles to compute one grev operation. It's the smallest GREV unit I've made so far. Mapped to a simple cmos library:

DFF 43

NAND 216

NOR 8

NOT 9

For size comparison: The trivial 32-bit rotate shift unit at the end of that Verilog file:

NAND 480

NOT 5

The reason I thought that the mask should be inverted was because I thought the common case ( all swaps enabled, e.g. for big-median support) it would be much more difficult to generate a constant with many 1s, but many 0s is easy.

Ah, okay. But since one would be using GREVI for that, not SHUFFLE, you would not even need a mask.

Finally- it isn’t clear to me how to implement a load signed big endian halfword, even with with grev.

Sign extension is not a permutation, so the sign extend itself cannot be done with grev. You'd need something like the following sequence (RV32):

lhu a0, offset(a0)
grevi a0, a0, 24
srai a0, a0, 16

The SRAI has a compressed encoding, but the LHU does not, so that sequence as it is would be 10 bytes. But you can replace LHU with LW (the extra bits are thrown away by the SRAI anyways), which would allow you to squeeze this sequence in 8 bytes.

regards,

- clifford

Allen Baum

unread,

Apr 21, 2018, 10:32:58 PM4/21/18

to Clifford Wolf, Jacob Bachmeyer, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

Ah- I couldn’t figure out how you did it in so few gates until I looked at the code and saw it was for RV32 instead of RV64. In any case, the amount of logic is trivial; it’s the wires that cost area ( across the datapath).

I thought you were proposing shuffle/unshuffle as a replacement for grev, not in addition to it. I think I see how you were implementing it by passing the mask down at each stage and shuffling it along with the data, and using it to enable the mux select (so mask would be forced to all1 for grev when mux selects can have any value, and be allowed to be set to something else when mux selects are limited to a single bit)

The cost there is mask routing follows the data routing; you have to shuffle it to match the data layout at the relevant level in the tree, and that’s at least 64 wires across the datapath, and possibly double that to match the data shuffling ( the same as the shifter)

Integrating shifter and grev together would cost less than each separately, but is a lot of wires, not much logic.

I’m ignoring the zip/unzip stages, which are probably each the same number of wires and almost no logic.

-Allen

Luke Kenneth Casson Leighton

unread,

Apr 21, 2018, 11:36:13 PM4/21/18

to Clifford Wolf, Allen Baum, Jacob Bachmeyer, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, Samuel Falvo II, Michael Clark, John Hauser

On Sat, Apr 21, 2018 at 7:47 PM, Clifford Wolf <cliffor...@gmail.com> wrote:
> So for that you'd need a "backup" fully functioning GREV unit for the other
> cases, right? This one might be an option (32 bit):
>
> https://github.com/cliffordwolf/xbitmanip/blob/master/verilog/tinygrev.v
>
>
> This takes 5 cycles to compute one grev operation. It's the smallest GREV
> unit I've made so far.

ha, very cool. I was just going to ask if you were evaluating
embedded scenarios as well as high-performance ones.

Clifford, I take it that you're looking at this "as a whole",
optimising the *overall* area (gate and with Allen's help wire area),
in particular reducing the *number* of instructions rather than
focussing on say whether one *specific* instruction takes up a
"disproportionate" amount of space [and arbitrarily eliminating it
from consideration exclusively on that basis alone]?

If so would it be nice to make sure that the same strategy works just
as well in the embedded (resource-constrained) environment as in the
performance one, even if the number of cycles for any given operation
to complete is much longer?

just going through V0.35-draft, it looks really good, exceptionally
comprehensive. question: 2.8 why return 0 in rd rather than throw an
exception? exception allows software-emulation in a trap, what did i
miss?

spelling mistake "opecodes" p14. patch attached (git am apply),
there's quite a few others.

p22 can i suggest including a reference implementation (c-code) of
64-bit BEXT/BDEP for emulation in a software-trap so that implementors
can make an informed assessment (and we can take a look as well)? in
particular i'd be interested to see how a 64-bit BEXT/BDEP would be
implemented using a combination of 32-bit BEXT/BDEP plus other 64-bit
xBitManip primitives.

with thanks and gratitude to everyone contributing: Allen, Clifford,
Rogier, Jacob and more.

l.

0001-various-spelling-corrections.patch

Clifford Wolf

unread,

Apr 22, 2018, 7:29:56 AM4/22/18

to Luke Kenneth Casson Leighton, Allen Baum, Jacob Bachmeyer, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, Samuel Falvo II, Michael Clark, John Hauser

Hi,

On Sun, Apr 22, 2018 at 5:35 AM, Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:

just going through V0.35-draft, it looks really good, exceptionally
comprehensive. question: 2.8 why return 0 in rd rather than throw an
exception? exception allows software-emulation in a trap, what did i
miss?

I actually was planning to add this as implementation-defined behavior. (I

did not want mandate it because it adds a data-dependent trap.) But I think

I'll replace shuffle anyways so it does not matter now.. (see other mail)

spelling mistake "opecodes" p14. patch attached (git am apply),
there's quite a few others.

applied.

p22 can i suggest including a reference implementation (c-code) of
64-bit BEXT/BDEP for emulation in a software-trap so that implementors
can make an informed assessment (and we can take a look as well)?

Section 2.7 ("Bit Extract/Deposit (bext, bdep)") contains c-code reference

implementations for bext and bdep. In what way are they isufficient?

Various Verilog reference implementations for different design corners can

be found at https://github.com/cliffordwolf/bextdep.

in particular i'd be interested to see how a 64-bit BEXT/BDEP would be
implemented using a combination of 32-bit BEXT/BDEP plus other 64-bit
xBitManip primitives.

See C code here:

https://github.com/cliffordwolf/xbitmanip/blob/master/cproofs/bextdep.cc

regards,

- clifford

Clifford Wolf

unread,

Apr 22, 2018, 7:48:48 AM4/22/18

to Allen Baum, Jacob Bachmeyer, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

Hi,

On Sun, Apr 22, 2018 at 4:32 AM, Allen Baum <allen...@esperantotech.com> wrote:

Ah- I couldn’t figure out how you did it in so few gates until I looked at the code and saw it was for RV32 instead of RV64. In any case, the amount of logic is trivial; it’s the wires that cost area (across the datapath).

yes it is for RV32. but because this implementation only uses one permutation instead of 5 (or 6 for RV64) you also need far fewer crossed wires.

I thought you were proposing shuffle/unshuffle as a replacement for grev, not in addition to it. I think I see how you were implementing it by passing the mask down at each stage and shuffling it along with the data, and using it to enable the mux select (so mask would be forced to all1 for grev when mux selects can have any value, and be allowed to be set to something else when mux selects are limited to a single bit)

There seem to be two misunderstandings.

(1) shuffle was never intended as replacement for grev. Of the two grev is clearly the more important operation. Shuffle (bfly+zip/unzip) is intended to help in situations where a core does not provide a single-cycle bext/bdep to reduce the number of calls to bext/bdep. (Like SLL, ADD, and SUB can help reduce the number of necessary calls to MUL.)

(2) The "tinygrev" core I posted only implements grev, not shuffle. So there are no per-bit-pair masks.

However, I'm now experimenting with dedicated zip instructions and a single "butterfly-exchange" instruction as a simpler alternative to "shuffle". I think this would be a much cleaner spec and would also be simpler to implement.

See Chapter 3, Alternative to shuffle instruction, in the current draft spec:

https://raw.githubusercontent.com/cliffordwolf/xbitmanip/master/xbitmanip-draft.pdf

Let me know if you agree that this proposal is less frightening than "shuffle".

regards,

- clifford

Luke Kenneth Casson Leighton

unread,

Apr 22, 2018, 7:57:31 AM4/22/18

to Clifford Wolf, Allen Baum, Jacob Bachmeyer, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, Samuel Falvo II, Michael Clark, John Hauser

On Sun, Apr 22, 2018 at 12:29 PM, Clifford Wolf
<cliffor...@gmail.com> wrote:

> On Sun, Apr 22, 2018 at 5:35 AM, Luke Kenneth Casson Leighton
> <lk...@lkcl.net> wrote:
>>
>> just going through V0.35-draft, it looks really good, exceptionally
>> comprehensive. question: 2.8 why return 0 in rd rather than throw an
>> exception? exception allows software-emulation in a trap, what did i
>> miss?
>
>
> I actually was planning to add this as implementation-defined behavior. (I
> did not want mandate it because it adds a data-dependent trap.) But I think
> I'll replace shuffle anyways so it does not matter now.. (see other mail)

ok :)

>> p22 can i suggest including a reference implementation (c-code) of
>> 64-bit BEXT/BDEP for emulation in a software-trap so that implementors
>> can make an informed assessment (and we can take a look as well)?
>
>
> Section 2.7 ("Bit Extract/Deposit (bext, bdep)") contains c-code reference
> implementations for bext and bdep. In what way are they isufficient?

Oh, no way insufficient, sorry i wasn't clear: I should have said
that I was interested to see a reference implementation which, turns
out to be exactly bext32_bext64 and bdep32_bdep64.

I don't know what to suggest (specifications are best as
self-contained documents?), perhaps an assembly-code version of
bext32_bext64 would be more compact and clearer (as part of the
document rather than an external reference which could change)?

l.

Jacob Bachmeyer

unread,

Apr 22, 2018, 6:59:17 PM4/22/18

to Clifford Wolf, Allen Baum, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

Clifford Wolf wrote:
> On Sat, Apr 21, 2018 at 1:52 AM, Allen Baum

> <allen...@esperantotech.com <mailto:allen...@esperantotech.com>>
> wrote:
> [...]

>
> Once you add the mask, though, all bets are off; it turns out to
> be nasty, as you need control each individual swap (as opposed to
> a shifter, which has one control for each log2 shift). The amount
> of logic is trivisl; the amount of wiring is horrible.
>
>
> Yes, you end up with log2(xlen) stage enable lines and xlen/2 bit
> control lines. If you AND them near the muxes then wiring should not
> be a big deal. Routing the xlen/2 control lines next to the data lines
> down the stages only adds +50% more wires in that direction, and there
> is always already a data wire running that mux->mux path that the
> control line must go, so you need a bit extra space for that wire in
> that channel, but the routing channel must already be there anyways.
>
> So I can see that the wiring is a bit worse compared to a GREV without
> the control lines (or ROT), but I do not see how if a regular ROT or
> GREV is okay, and a GREV with that control lines suddenly is
> "horrible". But I'm eager to learn if there is something crucial that
> I am missing.

The problem is not the implementation at all. The problem is getting
the control signals *into* the operation: there is not room in 32-bit
RISC-V to directly encode such an operation. SHUFFLE and UNSHUFFLE
therefore must use a control word in a register, although arranging the
control word layout such that it can be loaded with LUI on RV32 is
clever. If rs2 matches rd, that LUI/{UN,}SHUFFLE would be a possible
fusion pair.

-- Jacob

Jacob Bachmeyer

unread,

Apr 22, 2018, 7:11:20 PM4/22/18

to Clifford Wolf, Allen Baum, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

Clifford Wolf wrote:
> [...]

>
> However, I'm now experimenting with dedicated zip instructions and a
> single "butterfly-exchange" instruction as a simpler alternative to
> "shuffle". I think this would be a much cleaner spec and would also be
> simpler to implement.
>
> See Chapter 3, Alternative to shuffle instruction, in the current
> draft spec:
> https://raw.githubusercontent.com/cliffordwolf/xbitmanip/master/xbitmanip-draft.pdf
>
> Let me know if you agree that this proposal is less frightening than
> "shuffle".

Generalized zip is certainly easier to encode, avoiding the need for an
XLEN-bit control word. I like it, although I believe that the mnemonics
should be GZIP/GZIPI for the R-type and I-type forms, respectively.
Encoding space is available near the immediate shifts; in fact, if there
is no "generalized unzip", the "ROLI" hardware opcode could be used for
GZIPI, although this would preclude an R-type GZIP instruction. Is
generalized zip something that would be useful with run-time-variable mode?

-- Jacob

Clifford Wolf

unread,

Apr 22, 2018, 8:31:31 PM4/22/18

to Jacob Bachmeyer, Clifford Wolf, Allen Baum, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

Hi,

On Sun, Apr 22, 2018 at 06:11:16PM -0500, Jacob Bachmeyer wrote:
> Generalized zip is certainly easier to encode, avoiding the need for
> an XLEN-bit control word. I like it, although I believe that the
> mnemonics should be GZIP/GZIPI for the R-type and I-type forms,
> respectively. Encoding space is available near the immediate
> shifts; in fact, if there is no "generalized unzip", the "ROLI"
> hardware opcode could be used for GZIPI, although this would
> preclude an R-type GZIP instruction. Is generalized zip something
> that would be useful with run-time-variable mode?

I've just pushed a new version of the document with a bit more information.
https://raw.githubusercontent.com/cliffordwolf/xbitmanip/master/xbitmanip-draft.pdf

It would be an I-type only instruction and the least significant bit of the
mode already controls if it performs its inverse. So "ROLI" or "SALI" would
both work. (One for GREVI the other for ZIP.)

Furthermore, there are a few encodings in the ZIP mode field that are
either equivalent to other entries or no-ops. I've marked them as
"reserved" for now. But they could be used as encoding space for
unary functions like CLZ, CTZ, and PCNT.

There are also a few entries that encode for combinations of other
entries. It is unlikely that they would be used often in real-world
applications and there is an argument to be had that marking them also as
reserved could potentially ease the way for alternative/creative ways of
implementing the ZIP instruction.

regards,
- clifford

Jacob Bachmeyer

unread,

Apr 22, 2018, 11:08:09 PM4/22/18

to Clifford Wolf, Clifford Wolf, Allen Baum, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

Clifford Wolf wrote:
> On Sun, Apr 22, 2018 at 06:11:16PM -0500, Jacob Bachmeyer wrote:
>
>> Generalized zip is certainly easier to encode, avoiding the need for
>> an XLEN-bit control word. I like it, although I believe that the
>> mnemonics should be GZIP/GZIPI for the R-type and I-type forms,
>> respectively. Encoding space is available near the immediate
>> shifts; in fact, if there is no "generalized unzip", the "ROLI"
>> hardware opcode could be used for GZIPI, although this would
>> preclude an R-type GZIP instruction. Is generalized zip something
>> that would be useful with run-time-variable mode?
>>
>
> I've just pushed a new version of the document with a bit more information.
> https://raw.githubusercontent.com/cliffordwolf/xbitmanip/master/xbitmanip-draft.pdf
>
> It would be an I-type only instruction and the least significant bit of the
> mode already controls if it performs its inverse. So "ROLI" or "SALI" would
> both work. (One for GREVI the other for ZIP.)
>

Since generalized zip will only be GZIPI, I would suggest using the
"ROLI" encoding for GZIPI, and the "SAL"/"SALI" encodings for GREV/GREVI.

> Furthermore, there are a few encodings in the ZIP mode field that are
> either equivalent to other entries or no-ops. I've marked them as
> "reserved" for now. But they could be used as encoding space for
> unary functions like CLZ, CTZ, and PCNT.
>

A possibility, especially for the no-ops.

> There are also a few entries that encode for combinations of other
> entries. It is unlikely that they would be used often in real-world
> applications and there is an argument to be had that marking them also as
> reserved could potentially ease the way for alternative/creative ways of
> implementing the ZIP instruction.
>

Is there a systematic structure to the mode values and their meanings?

-- Jacob

Allen Baum

unread,

Apr 23, 2018, 12:15:54 AM4/23/18

to jcb6...@gmail.com, Clifford Wolf, Clifford Wolf, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

Resending, as this went off lost accidentally:

Answering some questions quickly, and not in order, then I have a plane to catch:

In a multistage shifter that can shift in both directions, you want to shift by the largest amount first to save logic.This [is/can be] implemented by s 127in-64out right shifter.
A right shift fills the upper 63 bits with either 0s or the sign for logical/arithmetic shifts ( or replicates the source for rotate)
For left shifts, the shift amount is 2s complemented, the lower half is filled with 0, and the source is place in the upper half.

The first stage is 64+32 bits wide, the second is 64+16, third is 64+8, etc.
Alternatives include a separate bitreverse at both input and output or separate shifters, or a rotater with a separate output masking stage.
(wires....)

This implementation also can do a doublewide shift (concatenate 2 64b inputs and shift them)
---------
Better terminology: most significant shift?
---------
I don't have serious preferences about the mask encoding; whatever works best for SW is the best.
i'll defer to your judgement about what is best.
---------
The reason that a fullwidthmask is less expensive is that in the halfwidth case, each mask bit controls 2 muxes (one at each end of the swap). But, depending on the butterfly stage, which 2 bits changes: sometimes if bitN and bitN^1, sometimes bitN^2, etc.
To handle that, you need to essentially permute the mask at each stage, even though its only used in one stage at a time.
Wires...
---------
I am a bit surprised that there are libraries that don't have AOI22s; they are pretty basic.
---------
Implementing zip/unzip for free: Well, not exactly free. The first stage of the shifter (by my reckoning) has XLEN wires running most of the way across the datapath. if you extend them to go all the way (since those channels can't be used for much else easily), then you can tap off exactly the bits you need at the positions you need them for zip/unip. The first mux stage becomes a 3way mux instead of a 2way (or N+1, depending on whatever else you have there).Given the amount of wiring there, that additional logic is not noticeable

Note that this doesn't easily generalize to multiple zip/unzips. Each additional one costs an additional mux input and additional vertical tap
You can't implement it at other stages because only the first stage has XLEN wires across the datapath
So, implementing all 6 zip/unzip turns the first stage into an 8way mux instead of 2way - assuming you can get all 8 taps into a bit pitch (probably yes for a custom layout - might be expensive otherwise)
-----------

-Allen

Clifford Wolf

unread,

Apr 23, 2018, 6:37:33 AM4/23/18

to Jacob Bachmeyer, Clifford Wolf, Allen Baum, Rogier Brussee, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Samuel Falvo II, Michael Clark, John Hauser

Hi,

On Sun, Apr 22, 2018 at 10:08:05PM -0500, Jacob Bachmeyer wrote:
> Since generalized zip will only be GZIPI, I would suggest using the
> "ROLI" encoding for GZIPI, and the "SAL"/"SALI" encodings for
> GREV/GREVI.

That makes sense.

Also, I've now renamed the instruction to "gzip" to avoid confusion with
the "zip" pseudo instruction (mode=XLEN-2).

Since there is no R-type instruction for gzip, and it's more of a
parametric unary function than a function with two arguments, I think
"gzip" is a better choice than "gzipi" for the instruction mnemonic.

>> There are also a few entries that encode for combinations of other
>> entries. It is unlikely that they would be used often in real-world
>> applications and there is an argument to be had that marking them also as
>> reserved could potentially ease the way for alternative/creative ways of
>> implementing the ZIP instruction.
>
> Is there a systematic structure to the mode values and their meanings?

Yes. mode[0] toggles if the instruction performs a zip or unzip. The other
mode bits toggle different stages in a permutation network.

The stages are (for RV32):

mode[1] -> zip.n
mode[2] -> zip2.b
mode[3] -> zip4.h
mode[4] -> zip8

Where
zip.n performs a zip on each nibble

zip2.b performs a zip on 2-bit blocks in each byte

zip4.h performs a zip on 4-bit blocks in each half-word

zip8 performs a zip on 8-bit blocks

(Note that those 4 operations are exactly those zip/unzip operations that
are their own inverse.)

There are two interpretations of mode[0] in this context. Either one can
think of it as inverting the order of the four permutation stages, or,
and that's how hw implementations would likely do it, it enables
additional pre- and post-stages that effectively turn zip.n into zip8,
zip2.b into zip4.h, zip4.h into zip2.b, and zip8 into zip.n.

I will add diagrams for the basic gzip and grev permutations. But for now I
added some C code. See updated version of the spec:

https://raw.githubusercontent.com/cliffordwolf/xbitmanip/master/xbitmanip-draft.pdf

I have now proven with cbmc that this method correctly implements all the
zip/unzip operations. See cproofs/genzip.{cc,sh}.

Therefore my next step will be to remove "shuffle" from the spec and move
gzip to the actual spec chapter of the document.

regards,
- clifford

--
"Backups are for wimps. Real men upload their data to an FTP site and
have everyone else mirror it." -- Linus Torvalds

lk...@lkcl.net

unread,

Apr 23, 2018, 8:19:15 AM4/23/18

to RISC-V ISA Dev, cliffor...@gmail.com, allen...@esperantotech.com, jcb6...@gmail.com, clif...@clifford.at, rogier....@gmail.com, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

clifford (et al) does this look about right as an implementation of bext32_bext64?

bext32_bext64: # emulates bext64 a5, a0, a2

addi a1, a0, 0 # src_lo copy low 32-bits

sraiw a0, 32 # src_hi shift down top 32 bits

addi a3, a2, 0 # mask_lo copy low 32-bits

sraiw a2, 32 # mask_hi shift down top 32 bits

bext a4, a1, a3 # lo bext32

bext a5, a0, a2 # hi bext32

slaiw a5, 32 # shift hi bext32 up

c.or a5, a4 # or two bext32s

Luke Kenneth Casson Leighton

unread,

Apr 23, 2018, 8:27:57 AM4/23/18

to Clifford Wolf, Clifford Wolf, RISC-V ISA Dev

couple more spelling corrections, signed-off this time (happy with the
license). if you know the commands to modify the previous commit i'm
happy to do that.
l.

0001-spelling-corrections.patch

Clifford Wolf

unread,

Apr 23, 2018, 10:50:17 AM4/23/18

to lk...@lkcl.net, RISC-V ISA Dev, cliffor...@gmail.com, allen...@esperantotech.com, jcb6...@gmail.com, rogier....@gmail.com, sam....@gmail.com, michae...@mac.com, jhause...@gmail.com

On Mon, Apr 23, 2018 at 05:19:15AM -0700, lk...@lkcl.net wrote:
> clifford (et al) does this look about right as an implementation of
> bext32_bext64?
>
> bext32_bext64: # emulates bext64 a5, a0, a2
> addi a1, a0, 0 # src_lo copy low 32-bits
> sraiw a0, 32 # src_hi shift down top 32 bits
> addi a3, a2, 0 # mask_lo copy low 32-bits
> sraiw a2, 32 # mask_hi shift down top 32 bits
> bext a4, a1, a3 # lo bext32
> bext a5, a0, a2 # hi bext32
> slaiw a5, 32 # shift hi bext32 up
> c.or a5, a4 # or two bext32s

"sraiw rX, 32" is not a valid instruction.

Luke Kenneth Casson Leighton

unread,

Apr 26, 2018, 7:46:39 AM4/26/18

to Clifford Wolf, RISC-V ISA Dev, clifford, Allen Baum, Jacob Bachmeyer, Rogier Brussee, Samuel Falvo II, Michael Clark, John Hauser

sorry, catching up: i believe i mean sraiw rX, rX, 32 and likewise
slaiw rX, rX, 32 is that better?

thinking about this over the past few days, my main concern with
going into the 64-bit case with effectively SIMD-like (twin split
32-bit) instructions is: they're SIMD-like, and as such my feeling is
that they should really be left *to* be parallelised *by* a given SIMD
(or vectorisation) proposal.

even without suitable "Object-Orientated" style parallelisation
extensions, would it be reasonable to expect VLIW (and possibly some
OoO) microarchitectures to inherently parallelise the above operations
into something approximating bext64?

even if *that's* not the case, would it be reasonable to exclude
bext64 (and bdep64, and any other SIMD-like twin-split 32-bit
operations) on the basis that xBitManip (or, all Extensions) should be
minimised so as to reduce opcode space usage?

and similar questions along similar lines.

l.

[1] TI C670* DSPs have a VLIW-like microarchitecture with up to *14*
instructions executable in parallel - with caveats

Clifford Wolf

unread,

Apr 26, 2018, 8:20:30 AM4/26/18

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, clifford, Allen Baum, Jacob Bachmeyer, Rogier Brussee, Samuel Falvo II, Michael Clark, John Hauser

On Thu, Apr 26, 2018 at 12:46:14PM +0100, Luke Kenneth Casson Leighton wrote:
> On Mon, Apr 23, 2018 at 3:50 PM, Clifford Wolf <clif...@clifford.at> wrote:
> > On Mon, Apr 23, 2018 at 05:19:15AM -0700, lk...@lkcl.net wrote:
> >> clifford (et al) does this look about right as an implementation of
> >> bext32_bext64?
> >>
> >> bext32_bext64: # emulates bext64 a5, a0, a2
> >> addi a1, a0, 0 # src_lo copy low 32-bits
> >> sraiw a0, 32 # src_hi shift down top 32 bits
> >> addi a3, a2, 0 # mask_lo copy low 32-bits
> >> sraiw a2, 32 # mask_hi shift down top 32 bits
> >> bext a4, a1, a3 # lo bext32
> >> bext a5, a0, a2 # hi bext32
> >> slaiw a5, 32 # shift hi bext32 up
> >> c.or a5, a4 # or two bext32s
> >
> > "sraiw rX, 32" is not a valid instruction.
>
> sorry, catching up: i believe i mean sraiw rX, rX, 32 and likewise
> slaiw rX, rX, 32 is that better?

*sigh* Those are not valid instructions either.
The shamt range for the *w variants of shift instructions is 0-31.

Luke Kenneth Casson Leighton

unread,

Apr 26, 2018, 8:35:37 AM4/26/18

to Clifford Wolf, RISC-V ISA Dev, clifford, Allen Baum, Jacob Bachmeyer, Rogier Brussee, Samuel Falvo II, Michael Clark, John Hauser

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Thu, Apr 26, 2018 at 1:20 PM, Clifford Wolf <clif...@clifford.at> wrote:

>> > "sraiw rX, 32" is not a valid instruction.
>>
>> sorry, catching up: i believe i mean sraiw rX, rX, 32 and likewise
>> slaiw rX, rX, 32 is that better?
>
> *sigh* Those are not valid instructions either.
> The shamt range for the *w variants of shift instructions is 0-31.

oh poop :) does anyone know the right ones? it might just be srai and slai.

l.