Review of XBitmanip v 0.9 (June 10)

Rogier Brussee

unread,

Sep 2, 2019, 4:55:16 PM9/2/19

to RISC-V ISA Dev

Review of Xbitmanip V0.9

Rogier Brussee

2019-09-03

Disclaimer:

Views are strictly my own, I have no particular expertise in this area, but have been following discussions for quite a while. Proposals are gratuitous, not based on quantitative analysis other than “you propose this but it can be done so with one or two instructions shaved of with this seemingly simpler instruction”, Finally review is not criticism, YMMV.

General remarks:

Great that the B extension is moving again and making progress. Special thanks to Clifford for pushing this for a long time, must have been an awful lot of work.

See below for general remarks on the B extension and other bit -related Z extensions.

For many sections it is not quite clear what the W versions (for RV64) or D versions (for RV128) are supposed to do. I will try to indicate how I understand it works.

It would be useful if a rationale for the use of an instruction would be given and quantitative data on use were available.

General remarks about this review

I try to use uppercase for instructions in the text to make them stand out and lower case in definitions, but there is no semantic difference.

Many sections contain C code as a way to define what the instruction is supposed to be doing. I like this because it is usually much easier to follow than words. However, I think there is value in making the definition as conceptually simple as possible even if it is (much) slower, and I will suggest some. Can we move faster but less intuitive implementations to an appendix? The simplest versions will usually be “bit by bit” algorithm , for which I use the following macros and functions

* I use LEN which can be #defined to XLEN to get the regular version or to 32, 64, for the W and D versions. One could perhaps use C++ template notation like sext<LEN> () instead.

* If the C code definitions involving shifts I prefer to write something like rs1 << (rs2 % LEN) because the shifts in RV are mod LEN. wherever a shift is made in C code, to emphasise that shifts in RV are mod XLEN, 32 (for W) or 64 (for D). This also makes it slightly clearer that it makes sense to shift over negative numbers.

As a result I avoid definitions like shamt = rs2 & (LEN - 1) which, of course, is equivalent but conceptually less clear.

* If the order of a loop is not important I use this macro

/* for loop that does not care about order: fake it by running backwards */

#define FOREACH( I , BEGIN, END) for( I = (END) - 1 ; I >= (BEGIN) ; —I)

* I use functions that do sign and zero extensions acting on the full XLEN bit register. Hardware is supposed to have better/cheaper ways to do this than the C code below.

int_xlen_t sext(int_xlen_t rs1, unsigned imm)

{

return (rs1 << ((- imm) % XLEN)) >> ( (- imm) % XLEN ) )

}

uint_xlen_t zext(uint_xlen_t rs1, unsigned imm)

{

return (rs1 << ((- imm) % XLEN)) >> ( (- imm) % XLEN ) )

}

* I use single bit functions acting on the full XLEN bit register, to set get and clr bits (named after some instructions in the proposal)

uint_xlen_t bget1(uint_xlen_t rs1, unsigned imm)

{

return rs1 & (1 << (imm % XLEN));

}

uint_xlen_t bset1(uint_xlen_t rs1, unsigned imm)

{

return rs1 | (1 << (imm % XLEN));

}

uint_xlen_t bclr1(uint_xlen_t rs1, unsigned imm)

{

return rs1 &~ (1 << (imm % XLEN));

}

uint_xlen_t binv1(uint_xlen_t rs1, unsigned imm)

{

return rs1 ^ (1 << (imm % XLEN));

}

uint_xlen_t sbget1(uint_xlen_t rs1 , unsigned imm)

{

return !!bget1(rs1, imm)

}

=======

Page 1 Chapter 1.

Comment: I would propose a further criterium:

Support compilers implementing the standard C like language basic data types (pointers, _Bool/bool, char, [un]signed char, [unsigned] short, [unsigned] int, [unsigned] long long) and their basic operations without surprises regarding performance and/or code size, in particular for software written sub optimally or not written for RiscV e.g. using unsigned or short integers.

Page 4. Section 2.1. Basic Manipulation instructions

Comment: It seems to me that the B extension has moved to “integer instructions that extend or optimise the basic I instructions. I think this is a good thing, and I would argue that the B extension should position as slightly cheaper than the M extension, useful for basic compiler support, standard applications like bit banging in device drivers, networking, and very basic algorithms like hashing.(see also the remark on supporting the C/C++ language). Z extensions can then cater for more specialised purposes.

Page 4: section 2.1.1 CLZ/CTZ

Alternative implementations :

uint_xlen_t clz(uint_xlen_t rs1)

{

uint_xlen_t rd = 0;

for (int i = (LEN-1) ; i >= 0 && bget1(rs1, i) ; —i)

++rd;

return rd;

}

uint_xlen_t ctz(uint_xlen_t rs1)

{

uint_xlen_t rd = 0;

for (int i = 0 ; i < LEN && bget1(rs1, i) ; ++i)

++rd;

return rd;

}

Please add:

For the W versions set LEN = 32.

Please add

Clz and ctz are used for finding the first elements of a bit map. In particular it is used for the implementation of malloc. Clz is also used for soft float

Section 2.1 Basic bit manipulation

Page 5: section 2.1.2 clz/ctz

Page 6 section 2.1.2 PCNT

Alternative implementation

uint_xlen_t pcnt(uint_xlen_t rs1)

{

uint_xlen_t rd = 0;

int i;

FOREACH(i , 0, LEN)

rd += sbget1(rs1, i);

return rd;

}

Please add:

For the W version set LEN = 32

Comment

Maybe it is useful to have a parity instruction even if this can be implemented with pcnt and &1, because it should use less power.

Proposal:

We have ctz(n) == pcnt(~n & (n -1)) == pcnt(~ (n |(-n))),

A two register instruction

cntz rd rs1 rs2 : rd = popcount(~ (rs1 | (-rs2)))

Would allow pcnt, ctz and clz to be implemented as 3 macros with at most two instructions instead of 3 instructions

pcnt rd rs1 : addi rd rs1 1; cntz rd zero rd

ctz rd rs1 : cntz rd rs1 rs1

clz rd rs1 : brev rd rs1; cntz rd rd rd

page 5 section 2.1.3. Logic with negate (andn, orn, xnor)

Comment: I like that all basic logical operations have a “n” version.

Is nand and nor not also worth it (the RVV mask instructions seem to have them for example) ?

Proposal:

I would propose the following additional logic instructions

#and rd rs1 rs2 # rd = rs1 & rs2

andn rd rs1 rs2 # rd = rs1 &~ rs2

nand rd rs1 rs2 # rd = ~(rs1 & rs2)

orn rd rs1 rs2 # rd = ~(rs1 &~ rs2) = ~rs1 | rs2

and.ne rd rs1 rs2 # rd = (-!!rs1) & rs2 == (!!rs1) ? rs2 : 0

andn.ne rd rs1 rs2 # rd = (-!!rs1) &~ rs2 == (!!rs1) ? ~rs2 : 0

orn.eq rd rs1 rs2 # rd = ~((-!!rs1) & rs2) == (-!rs1) |~ rs2 == (!!rs1) ? ~rs2 : (-1)

or.eq rd rs1 rs2 # rd = ~((-!!rs1) &~ rs2) == (-!rs1) | rs2 == (!!rs1) ? rs2 : (-1)

#or rd rs1 rs2 # rd = rs1 | rs2

nor rd rs1 rs2 # rd = ~(rs1 | rs2)

#xor rd rs1 rs2 # rd = rs1 ^ rs2

xnor rd rs1 rs2 # rd = ~(rs1 ^ rs2) == rs1 ^ rs2 ^ (-1) == rs1 ^ ~ rs2

This would make AND a 3 bit family (i.e. a basic instruction with some “option flags” or small parameters, here nothing/~ on rs2, nothing/-!! on rs1, nothing/~ on rd) and OR and XOR a 1 bit family of instructions (nothing/~ on rd) that can be placed with the AND OR and XOR instructions themselves (i.e an instructions ANDN, NAND and AND.NE are supposed to have the same func3 values as AND being “AND”-like, but differ in a few bits in the funct7 space ).

The ORN instruction with the inversion on rs1 would allow

C.not rd —> orn rd rd zero

to be more regular than XOR rd rs -1 or a ORN with a ~ on rs2, but with ANDN, NAND, ORN, NOR and XNOR, a C.NOT instruction may not be such a high priority instruction anymore.

The following macros implement the boolean (i.e. 0/1 valued) operators in two instructions

sand rd rs1 rs2 : and.ne rd rs1 rs2 ; neqz rd rd # rd = !!((-!!rs1) & rs2) == !!rs1 & !!rs2

snand rd rs1 rs2 : and.ne rd rs1 rs2 ; eqz rd rd # rd = !((-!!rs1) & rs2) == !rs1 | !rs2

sorn rd rs1 rs2 : or.eq rd rs1 rs2 ; neqz rd rd # rd = !!((-!rs1)| rs2) == !rs1 | !!rs2

sandn rd rs1 rs2 : or.eq rd rs1 rs2 ; eqz rd rd # rd = !((-!rs1)| rs2) == !!rs1 & !rs2

sor rd rs1 rs2 : or rd rs1 rs2. ; neqz rd rd #rd = !!(rs1 | rs2) == !!rs1 | !!rs2

snor rd rs1 rs2: or rd rs1 rs2 ; eqz rd rd #rd = !(rs1 | rs2) == !rs1 & !rs2

sneq rd rs1 rs2: xor rd rs1 rs2 ; neqz rd rd #rd = !!(rs1 ^ rs2) == rs1 != rs2

seq rd rs1 rs2: xor rd rs1 rs2 ; eqz rd rd #rd = !(rs1 ^ rs2) == rs1 == rs2

In the vast majority of cases the NEQZ and EQZ instructions will be taken care of by the branch or the AND/OR.NE/EQ instructions.

The macros

mneqz rd rs1 : andn.ne rd rs1 zero

meqz rd rs1 : or.eq rd rs1 zero

give a logic value as 0/(-1) instead of 0/1 which is useful in combination with masking.

Like the other logical instructions these instructions preserve but don’t establish the invariant of a word being sign extended. No W versions are proposed.

Page 6 section 2.1.4 pack

uint_xlen_t pack(uint_xlen_t rs1, uint_xlen_t rs2)

{

return zext(rs1, LEN/2) | (zext(rs2, LEN/2) << LEN/2);

}

Where LEN = 32 for the W version.

please add

The pack instruction can be used to pack “struct” data that fits in a XLEN sized register and is passed in registers.

Proposal:

I really like the pack instructions. However I would propose to have 4 and XLEN independent pack instructions:

pack.bu rd rs1 rs2 # rd = zext(rs1, 8) | ( zext(rs2, 8) << 8 )

pack.hu rd rs1 rs2 # rd = zext(rs1, 16) | (zext(rs2, 16) << 16)

pack.wu rd rs1 rs2 # rd = zext(rs1, 32) | (zext(rs2, 32) << 32). //(RV64)

pack.du. rd rs1 rs2 # rd = zext(rs1,64) | (zext(rs2, 64) << 64) //(RV64, 128)

Page 7.

Comment.

One should not need to use shfl instructions to pack bytes. See proposal for pack.bu.

The ZEXT pseudo instruction could also use ADDWU. Also this is (as far as I can see) the only mention of RV128. It should perhaps be mentioned everywhere (preferably) or nowhere.

Page 7 section 2.1.5 Min/Max

Proposal:

I would propose to also add SGE and SGEU. As is done in RI5CY/PULP, this would be nicely placed with the SLT and SLTU instructions (i.e. have the same func3 bits as SLT and SLTU see proposal in section 2.1.3 ) making each a 2 bit family

//slt rd rs1 rs2 # rd = rs1 <_s rs2

sge rd rs1 rs1 # rd = rs1 >=_s rs2

min rd rs1 rs2 # rd = rs1 <_s rs2 ? rs1 : rs2

max rd rs1 rs2 # rd = rs1 >=_s rs2? rs1 : rs2

//sltu rd rs1 rs2 # rd = rs1 <_u rs2

sgeu rd rs1 rs1 # rd = rs1 >=_u rs2

minu rd rs1 rs2 # rd = rs1 <_u rs2 ? rs1 : rs2

maxu rd rs1 rs2 # rd = rs1 >=_u rs2? rs1 : rs2

Perhaps this would make it beneficial to define

//sneqz rd rs1 : sltu rd zero rs1

seqz rd rs1 : sgeu rd zero rs1 # instead of sltui rd rs1 1

Page 8

Taking the absolute value of an integer is a generally useful operation and I think it is not needed to mention SAT solvers.

Page 8. Section 2.1.6 sbset/sbclr/sbinv/sbext

Please add:

These instructions efficiently support bitfields, and avoid costly constant constructions and nasty surprises especially if the bitfields are not in low bits.

Bikeshed.

I think the s prefix should be reserved for instructions returning booleans (0/1), which means only SBEXT is properly named. I would suggest BSET1, BCLR1, BINV1, and SBEXT1 or SBGET1 (but see below)

I would argue that a BGET1 instruction

uint_xlen_t bget1(uint_xlen_t rs1, uint_xlen_t rs2)

{

return rs1 & ((uint_xlen_t)(1) << (rs2 % LEN))

}

Is more useful and cheaper than a SBGET1 instruction.

Page 9 Section 2.1.7 SLO/SRO

Comment:

I can see that these instructions are cheap on the alu . However I still don’t really see too much rationale for them and each uses a precious 7 bit immediate.

The main application seems to be to construct a mask of imm 1’s i.e. (1 <<imm ) - 1, but the sequence

li rd -1 ; srli rd rd (-imm % XLEN)

Is just as long if rd is one of the popular registers x8—x15 and can be fused if needed.

Page 10 section Bit Permutation Instructions

Page 10 section 2.2.1 rol/ror

Please add:

For the rotation instruction we consider the bit pattern in a register as being indexed by a number in the integers mod LEN (and do sign extension for LEN < XLEN), and “shift” the bit pattern in rs1 by offsetting the indices considered as an integer mod LEN with rs2 mod LEN.

The main use of the rotation instructions are block cyphers, hashing and ??????.

uint_xlen_t ror(uint_xlen_t rs1 , uint_xlen_t rs2)

{

//FOREACH(i, 0, LEN)

// rd[i] = rs1[(i + rs2) % LEN]

uint_xlen_t rd = (rs1 >> (rs2 % LEN)) | (rs1 << ((-rs2) % LEN));

return sext(rd, LEN);

}

uint_xlen_t rol(uint_xlen_t rs1 , uint_glen_t rs2)

{

return ror(rs1, -rs2);

}

For the w version LEN = 32 and the w versions of the shift are used.

Comment:

While rol and ror are the easiest to explain permutation, they are also the most easily done with just shifts and OR’s. The butterfly network is also not really simpler than that for GREV, i.e slightly expensive. With the below SLLNEG/SRLNEG and more importantly, immediate versions, it is just 3 instructions. MIPS doesn’t have a ROL and ROR instruction. Are ROL and ROR actually worth it?

Proposal:

ROL is just ROR with rs2 negated i.e. “ROL == RORNEG” (or equivalently ROR == ROLNEG). Therefore, one of ROL and ROR can be considered overkill especially since the immediate version will be much more common than the register version. Alternatively, if ROL is considered useful, the corresponding NEG versions for the regular shifts should be even more useful.

Define SLLNEG, SRANEG and SRLNEG instructions

sllneg rd rs1 rs2 # rd = rs1 << ((-rs2) % XLEN))

srlneg rd rs1 rs2 # rd = rs1 >>_u ((-rs2) % XLEN))

sraneg rd rs1 rs2 # rd = rs1 >>_s ((-rs2) % XLEN))

sllnegw rd rs1 rs2 # rd = rs1 <<_w ((-rs2) % 32))

srlnegw rd rs1 rs2 # rd = rs1 >>_wu ((-rs2) % 32))

sranegw rd rs1 rs2 # rd = rs1 >>_ws ((-rs2) % 32))

This shaves of one instruction of doing ROL/ROR or FSL/FSR with shifts and OR, but likewise shaves of two instructions in the bit extraction and insertion pattern for the register version.

The instructions can use the existing negator on rs2, although the bit to indicate negation in the add instruction is unfortunately already taken up for arithmetic vs. Logical shift for the right shifts. No immediate version is needed.

Page 11 section 2.2.2

Please add:

For the rotation instruction we considers the bit pattern in a register as being indexed by LOG2(LEN) bits (and do sign extension for LEN < XLEN), and “shifts” the bit pattern in rs1 by XOR-ing the indexbits considered as LOG2(LEN) bits with the last LOG2(LEN) bits of rs2 (i.e. rs2 & (LEN - 1) == rs2 % LEN)

Please keep

The GREV instruction provides …… instruction.

Please change :

It takes a single register…. reversals occur.

Please add.

Byte order swap (rs2 == -8) is essential for networking. Bit order reversal (rs2 == -1) is needed for the fast Fourier transform and “reverse addressing”.

Please change:

s/This operation iteratively/The grev instruction can be implemented with a butterfly network that iteratively/

Page 12.

Comment:

I really don’t like the C functions of grev as a definition. The “intuitive to understand” is in fact not so intuitive. . Can we put that in an appendix??? The obviously correct (but dreadfully slow) definition is:

uint_xlen_t grev(uint_xlen_t rs1, uint_xlen_t rs2)

{

//for(int i = 0; i < LEN; ++i)

// rd[i] = rs1[i ^ (rs2 & (LEN -1))]

uint_xlen_t rd = 0;

FOREACH( i, 0, LEN)

rd |= sbget1(rs1, i ^ (rs2 & (LEN - 1)) << i

return sext(rd, LEN);

}

The W version has LEN = 32, the D version has LEN = 64.

Comment.

I really don’t like the explanation of the pseudo-instructions below the table. What do they actually do? Something like

In other words the prefix controls the number of bits of the grouping whose order is reversed as a group, (bits, bytes, words..) where as the suffix controls the number of bits in within which the order is reversed. Thus rev8 swaps bytes within the whole XLEN bit register, whereas rev8.w swaps bytes within each of the 32 bits word within a XLEN register.

Comment.

Grev/Grevi is probably the right thing to do. However, I want to point out that given that the overwhelmingly most common use case would be byte swap (i.e. imm = -8), and that moreover all grev’s can be reduced to grev’s of a negative power of 2 grev’s of a power of 2, just as they can be reduced to be reduced to grev’s of a posive power of 2 i.e. the butterfly stages. This is because

grev(rs1, k) == grev(rs1, k % XLEN),

grev(rs1, k ^ l) == grev(grev(rs1, k), l),

grev(rs1, 0) == rs1,

grev(rs1, (-a)&k) == a? grev(rs1, k) : rs1 for a = 0,1,

Finally every number 0 <= k <= XLEN - 1 can be written as

k = ((- a_0) & (-1) )^ ((-a_1)&(-2)) ^ ((-a_2)&(-4)) ^ …. ^ ((-a_{LOG2(XLEN) -1})&(-XLEN/2))

with the single bits a_0, a_1, a_2, .. a_{LOG2(XLEN) -1} defined uniquely by the identities

sbget1(k, 0) == a_0, sbget1(k, i) ^ sbget1(k, i-1) == a_i , i > 0.

Therefore there is the possibility to have the grev functionality with instructions corresponding to a single stage of a slightly different butterfly network but with the freedom of passing a mask to select which bits are swapped by mixing at the cost of having to explicitly compose a sequence of max LOG2(XLEN) instructions if the full functionality of grev is to used:

rev1 rd rs1 rs2 # rd = (grevi(rs1, -1 ) &~ rs2) | (rs1 & rs2) implements brev rd rs1 : rev1 rd rs1 zero

rev2 rd rs1 rs2. # rd = (grevi(rs1, -2 ) &~ rs2) | (rs1 & rs2)

rev4 rd rs1 rs2 # rd = (grevi(rs1, -4 ) &~ rs2) | (rs1 & rs2)

rev8 rd rs1 rs2 # rd = (grevi(rs1, -8 ) &~ rs2) | (rs1 & rs2) implements bswap rd rs1: rev8 rd rs1 zero

rev16 rd rs1 rs2 # rd = (grevi(rs1, -16) &~ rs2) | (rs1 & rs2)

rev32 rd rs1 rs2 # rd = (grevi(rs1, -32) &~ rs2) | (rs1 & rs2) //(RV64, RV !28)

rev64 rd rs1 rs2 # rd = (grevi(rs1, -64) &~ rs2) | (rs1 & rs2) //(RV128)

P9 Generalised shuffle and unshuffle.

Comment:

I must admit, I neither understand the description nor the rationale for these instructions other than that it allows arbitrary bit permutation in LOG2(XLEN) ops Nor do I understand why the zip versions are especially important. Not knowing the shuffle operations is, I am sure, a lack of education on my part, but a clearer explanation would help: (a clearer definition? is there an obvious algorithm to implement an arbitrary permutation? Is this optimal? Reference?)

Comment:

From the description in Hackers Delight, I get that the important cases are transpose of bit matrices. I do understand matrix transpose and e.g. making the even bits the first and the odd bits last bits! You have a matrix transpose later which the following would generalise.

Proposal:

/* Consider a register as a matrix of XLEN >> rs2 rows of length 1 << rs2 and transpose, i.e. interchange rs2 least significant bits become the rs2 most significant bits, and the XLEN - rs2 most significant bits become the XLEN - rs2 least significant bits of the index.

The immediate version has a 3 bit immediate, and a register version may not be needed)

*/

uint_xlen_t transp(uint_xlen_t rs1, uint_xlen_t rs2)

{

//for(int i = 0 ; i < LEN; ++i)

// rd[ i ] = rs1[ (i >> rs2) % LEN | (i << (LOG2(LEN) - rs2) % LEN] ;

uint_xlen_t rd = 0;

int i;

FOREACH(i, 0 XLEN)

rd |= sbget1(rs1, (i >> rs2) % LEN) | (i << (LOG2(LEN) -rs2) % LEN) << i;

return sext(rd, LEN);

}

for the W version LEN = 32, for the D version LEN = 64 ( I am not sure a W or D version is needed however).

Page 17.

I don’t understand the notation in table 2.2

Page 17 — 19

I don’t like the definition of SHFL instruction as a C function here and the pattern for RV128 is definitely not so clear if you cannot visualise the numbers as bit patterns (although the picture helps). Can we please have a (possibly slow) more conceptual bitwise definition?

Page 20

What is meant with “bitwise prefix sum of the bits in the lower byte of a 32 bit word on RV32” here?

Page 21

While clever, where do these (type of) problems come from that they merit an instruction.

Page 22.

Section 2.3 bext /bdep

Comment:

I like these instructions because they are clean and clear. However it would still be good to have applications that e.g. use the x86 BMI instructions. The example seems a little artificial.

Since you already say that they often are not faster than software emulation, and one of the first hits you get for PEXT is that AMD implements them horribly slow, maybe they should not be in B but still be defined as Z extension.

The following C functions for BEXT/BDEP are, I think, easier to understand and faster as software implementations (but still far from optimal)

uint_xlen_t bext(uint_xlen_t rs1, uint_xlen_t rs2)

{

uint_xlen_t rd = 0;

uint_xlen_t dbit, bit;

//for each bit in the source mask,

for(dbit = 1; (bit = rs2 &~ (rs2 -1)); dbit <<=1, rs2&= ~bit )

rd |= (bit & rs1 )? dbit : 0;

return rd;

}

uint_xlen_t bdep(uint_xlen_t rs1, uint_xlen_t rs2)

{

uint_xlen_t rd = 0;

uint_xlen_t sbit, bit;

//for each bit in the destination mask

for(sbit = 1; (bit = rs2 &~ (rs2 -1)); sbit <<=1, rs2&= ~bit )

rd |= (sbit & rs1)? bit : 0;

return rd;

}

Page 23 CLMUL/CLMULH/CLMULR

Please add

The CLMUL and CLMULH/ CLMULR instructions implement polynomial multiplication over the Galois field GF(2) and the closely related multiplication in the higher Galois fields GF(2^n). They are used extensively in cryptography and can be used to compute CRC’s which are defined by computations in Galois fields.

Bikeshed :

Carryless multiplication is the termininology used by Intel, but that makes the mathematician in me weep. This is polynomial multiplication for the love of $DEITY! Can we please call these instructions POLMUL[HR] ?

Page 24

S/GCM/GCD/GCM (greatest common divisor/multiple) of binary polynomials/

What is the prefix XOR operation?

Page 26

Comment

Isn’t a CLMUL instruction easier and cheaper than a MULT instruction, so may be implemented even if M is not implemented? ( I may be wrong here and obviously a MULT instruction has more general benefit).

Page 26 section 2.6 bmatxor, Bator, bmatflip

I like the binary metrix operations. However, maybe it is enough to just define the more primitive multiplication between a matrix and a vector. We can then also have different sized (power of two by power of two) sized matrices multiplied to vectors.

Proposal:

orred rd rs1 3imm #take the OR reduction of groups of 1 << imm bits

# rd = 0;

# FOREACH(I, 0, XLEN)

# rd[i >> imm] |= rs1[i];

xorred rd rs1 3imm # take the XOR reduction of groups of 1 << imm bits

# rd = 0;

# FOREACH(i, 0; XLEN)

# rd[i>> imm] ^= rs1[i];

repeat rd rs1 3imm # repeat the first 1 << imm bits

# rd = 0;

# FOREACH(i = 0, XLEN)

# rd[i] |= rs1[i % 1<<imm]

transp rd rs1 3imm #transpose a bit matrix of 1 << imm column and XLEN>>imm rows

# rd = 0;

# FOREACH(i, 0, XLEN)

# rd[i] |= rs1[ (i<<imm) % XLEN | (i >> (LOG2(XLEN) - imm)) % XLEN ]

These can then be used to compute a matrix vector multiplication using REPEAT, AND and XORRED.

uint_xlen_t matvec(uint_xlen_t rs1, uint_xlen_t rs2, unsigned log2_size)

{

uint_xlen_t v = repeat(rs2, log2_size);

uint_xlen_t rs1_v = rs1 & v;

return xorred(rs1_v, log2_size);

}

However these instructions are also useful for computing parity, creating masks, or (in the case of ORRED) finding the terminating zero in a string, reading in XLEN/8 bytes at a time.

Page 29. Section 2.7.1

Comment:

Ternary operators have pushback because they are a burden on the micro architecture, and seem like the perfect candidate to replace by a fusable instruction sequence that can be recommended to compiler people, provided this is feasible.

In this case two such sequences exist: since

(rs1&rs2) | ((~rs1) & rs3) == (rs1 & rs2) ^ ((rs1 ^ (-1)) & rs3)

== (rs1&rs2) ^ (rs1 & rs3) ^ rs3

== (rs1 & (rs2 ^ rs3))) ^ rs3

We can write

CMIX rd rs1 rs2 rs3 # rs1&rs2 | (~rs1) & rs3, assume rd != rs1, rs3

{

XOR rd rs2 rs3

AND rd rd rs1

XOR rd rd rs3

}

and

CMIX rd rs1 rs2 rs3 # rs1&rs2 | (~rs1) & rs3, assume rd != rs1, rs2

{

XOR rd rs2 rs3

ANDC rd rd rs1

XOR rd rd rs2

}

which gives some freedom for the compiler writer.

Page 29 2.7.2

Comment:

We have the same issue with the ternary operator here of course.

If we have the AND.NE instruction we can proceed as for the mix operator and define the fusable sequences

CMOV rd rs1 rs2 rs3. # rs1 ? rs2 : rs3 assume rd != rs1, rs3

{

XOR rd rs2 rs3, assume rd != rs1, rs3

AND.NE rd rs1 rd

XOR rd rd rs3

}

With a similar alternative sequence with AND.EQ with constraints rd != rs1, rs2

Or we can just define the sequences

CMOV rd rs1 rs2 rs3 # rs1 ? rs2 : rs3, assume rd != rs1, rs3

{

C.MV rd rs2

C.JNEZ rs1 +2

C.MV rd rs3

}

with and an alternative sequence with JEQZ.

Note that SiFive is already optimising a JNEZ/ JEQZ of +2 bytes and a C.MV (or other simple arithmetic instructions) as defining a conditional instruction.

Proposal:

An important special case of the three operand conditional move (a ? b : c) is the two operand elvis operator a?:c which is (!!a)? a : c. (according to wikipedia ?: is called Elvis operator because it was/is an emoji for Elvis Presley). Instead of a CMOV define an “ELVIS” instruction

elvis rd rs1 rs2 # rd = rs1 ?: rs2 == (!!rs1) ? rs1 : rs2

Page 30

Does changing the order of the registers in the assembler compared to the encoding not create more confusion than it solves (I genuinely don’t know).

Please add:

For use of the FSL and FSR instruction see section 3.

Spelling out the C function as

unit_xlen_t fsr(uint_xlen_t rs1, uint_xlen_t rs2, uint_xlen_t rs3)

{

uint_xlen_t rd, tmp;

//test that rs3 % (2*XLEN)) >= XLEN

tmp = rs3 & (1 << LOG2(XLEN)); //ANDI

if (tmp ) { //JNEZ

rd = rs3 << (rs2 % XLEN); //SLL

tmp = rs1 >> ((-rs2)%XLEN); /SRLNEG

rd |= tmp ; //OR

}else{

rd = rs1 << (rs2 % XLEN); //SLL

tmp = rs3 >> ((-rs2)%XLEN)); //SRLNEG

rd |= tmp ; //OR

};

return rd;

}

unit_xlen_t fsl(uint_xlen_t rs1, uint_xlen_t rs2, uint_xlen_t rs3)

{

return fsr(rs1, -rs2, rs3);

}

shows that the immediate versions(where the compiler can do the test) are just three instructions

(assuming a SRLNEG instructions exists otherwise it is four) and a temporary register clobber, while the non immediate version is just 5 at runtime and 8 static.

Also FSL is FSRNEG i.e. FSL and FSR are essentially equivalent instructions which take up lots of encoding space, so two ternary instructions may be a bit overkill.

Page 31 Section 2.8 Unsigned address calculation

Comment: I like these instructions. Since adds are so dominant in usage and in particular in address calculations in most code they may well turn out the most used instructions of the B extension.

What does ADDIWU do? Does it zero or sign extend its 12 bits immediate? (I would suggest zero extending is more useful as it increases the range of constants that can be defined in one instruction)

The i^j example is not terribly good as i^j happens to be already zero extended, if i and j are properly sign extended. However the seemingly innocuous expressions p[i] or p + i for i unsigned already involves a zero extend of i (if p is a char*) or a zero extend and a shift.

Please add:

The SLLIU.W instruction is used to compute addresses of array elements indexed by unsigned integers.

Comment:

Is there no SLLU.W (aka SLL.WU) instruction?

Comment:

It might be slightly better to have rs1 zero extended instead of rs2. That leaves open the possibility to have a slightly more regular C instruction

C.ZEXT rd —> ADD.WU rd rd zero

Bikeshed

I think these instructions should be called ADD.WU since one adds a zero extend Word to a XLEN sized to obtain a XLEN sized and SLLI.WU since one shifts a zero extended Word to get a XLEN sized.

Proposal:

I think the scheme could and should be extended to bytes and halves. In C the sum of two shorts is actually an int (rather than a short) so extending the scheme to bytes and halves covers software that uses (for better or for worse) shorts as an integer and avoids nasty surprises. So I propose to have for example

ADD.H rd rs1 rs2 # SEXT.H rd rs2; ADD rd rs1 rd

SUB.HU rd rs1 rs2 # ZEXT.HU rd rs2; SUB rd rs1 rd

Where of course SEXT.H and ZEXT.HU don’t actually exist as instructions, in fact can be made pseudo instructions using ADD.H and ADD.HU using rs1 == zero.

The logic of the W instructions then also gives instructions

ADDW.H rd rs1 rs2 # ADD.H rd rs1 rs2 ; SEXT.W rd rd

This should be complemented with

ADDWU.HU rs rs1 rs2 # ADD.HU rd rs1 rs2; ZEXT.WU rd rd

Likewise I think there is value in having pointer + integer instruction i.e. an instruction that absorbs the shift like LEA in x86 (one of the most commonly used arithmetic instructions on x86, although that is a bit misleading because it is also the 2 operand add on x86). People have been touting fusion of

SLLI rd rs2 imm

ADD rd rs1 rd

but note that one cannot use the RVC version of SLLI unless rd == rs2 (the register containing the index) and for basic implementations an “already fused instruction”

ADDSL<k> rd rs1 rs2 # SLLI rd rs2 <k> ; ADD rd rs1 rd

actually seems easier to implement than fusion.

Together that gives the following add instructions :

#add #addw #addd

addsl1 addwsl1 adddsl1

addsl2 addwsl2 adddsl2

addsl3 addwsl3 adddsl3

addsl4 addwsl4 adddsl4

addsl5 addwsl5 adddsl5

addsl6 addwsl6 adddsl6

addsl7 addwsl7 adddsl7

add.b addw.b addd.b

add.h addw.h addd.h

add.w addd.w

add.d

add.bu addwu.bu adddu.bu

add.hu addwu.hu adddu.hu

add.wu addwu adddu.wu

add.du adddu

With the pattern repeating for sub.

While these are a lot of different ADD’s, the instructions form a densely encoded and regular

5 bit family of instructions (add/sub, XLEN/sized, (signed/unsiged, 2size) or 3shamt), and they all correspond directly to some form of add between basic datatypes in the C language.

For RV32 there are 2*( 7 + 2 + 2 ) == 22 new add/sub instructions. Moreover are dominant in ALU use, and

The additional SUB versions seems much less important than the add versions and may perhaps be skipped. However since SUB itself can obviously not be skipped, regularity of encoding suggests having them anyway.

Note that an ADDSL<k> instruction reduces the need for a SLLI.WU instruction: for 0 < imm < 8

SLLI.WU rd rs2 imm

C.ADD rd rs1 rd

is equivalent to (but 2 bytes shorter than)

ZEXT.WU rd rs2

ADDSL<imm> rd rs1 rd

Here I assumed that rs2 gets shifted and zero or sign extended. What is best for a potential C.ADDSL2/3 instruction probably all depends on fusing patterns.

Also SUB is non symmetric so there is a real difference between

(V1) SUBSL<k> rd rs1 rs2 # rd = rs1 - rs2 <<k

and

(V2) SUBSL<k> rd rs1 rs2 # rd = rs1 <<k - rs2

(But then again, quite possibly all the different variants of SUB is overkill)

Page 32, Section 2.10 Future compressed instructions

I would take this section out, or just suggest that C.NOT, C.NEG and C.ZEXT.W and C.ZEXT.D may be worthwhile additions. More quantitative data is needed however, and this just cannot be done just for the B extension. For example I think another instruction that may be worthwhile is

C.LPC rd —> AUIPC rd 0. # load program counter

to make PC -relative addressing (e.g. for loading constants!) more compact, and this would fit very nice in the C.LUI rd 0 slot. Personally I doubt that a C.NOT instruction is worth it, and given the choice, a C.NEG instruction is probably more generally useful than C.NOT. (Random case in point: on this random x86 instruction usage https://www.strchr.com/x86_machine_code_statistics NOT does not even make it to the list of mentioned instructions). C.ZEXT.WU may be worth it if people don’t want the WU ADD’s

Page 33

Encodings are a bitch. Perhaps MIN[U] and MAX[U] should be placed with SLT and SLTU and

ADDU.W and SUBU.W (aka ADD.WU and SUB.WU) should be placed with ADD rather than ADDW

Page 38 Section 2.11 Future 64 bit instructions for bitfield extract and place

I would take this section out.

Page 39 Section 2.12 Micro architectural considerations and macro-op fusion for bit manipulation

Page 40 Fused load-immediate sequences

Comment.

Clifford rightly started a discussion about long immediates (and long JAL!!!) maybe mentioning that this is being discussed is useful!

Comment

If PACK is changed to have PACK.BU, PACK.HU PACK.WU and PACK.DU, the PACK.WU version should be used in the examples.

Page 41:

S/32-bit numbers that have their MSB set/zero extended 32-bit numbers/

(Ie. they don’t have there MSB set but _unset_!)

Comment

If it is decided that ADDWUI should have a zero extended 12 bit immediate, mention that with the “And finally.

Page 41 Section 2.12.3 Fused *-not sequences

Is postfix NOT still very important with a ANDN ORN XNOR NAND and NOR instruction?

Page 42 Section 2.12.4

I think the extraction sequence should be

slli rd rs1 (-(pos + len)) % XLEN

srli rd rd (-len) % XLEN

this always makes sense, and does “the right thing” for 0 <= pos + len <= XLEN, pos >= 0, and

len >= 0 except pos == len == 0 :-(. It also shows the use of SLLNEG and SRLNEG for the non immediate case)

I think the insertion sequence of a bit field of length len starting at bit 0 is wrong and should be

slli rd rs1 (-len) % XLEN

srli rd rd (-(len + pos)) % XLEN

This also does “the right thing” except for pos == len == 0

Which other sequences with postfix srli an slli have applications ?

Comment:

The elephant in the room is to what extent to standardise if compilers should recognise recommended fused sequences, what sequences to recommend, and to what extent compiler writers are supposed to respect these sequences.

Page 42 section 2.12.6 Fused ternary operations

I would add this as a general remark in the previous section that all arithmetic ops of the form

Alu-op1 rd rs1 rs2

Alu-op2 rd rd rs3 (rd != rs3)

(And for that matter

Alu-op1 rd rs1 rs2

Alu-op2 rd rs3 rd )

Are amenable to fusing to an alu-op12 rd rs1 rs2 rs3, and that this is the preferred way of dealing with 3 input register all-ops if possible. Note the rd != rs3 restriction. This extends to three or more alu-op sequences where even more restrictions may be needed (see e.g. the sequences for CMOV and CMIX) Should such “designed to be fused” sequences be preceded with a hint?

Page 43.

The XNOR macro is already defined as an instruction with the same semantics.

Comment:

I proposed that NAND and NOR should also be instructions with the same semantics. See the logical instructions section.

The BFMAK macro definitions seems wrongly defined and I think it should be simply

bfmak rd, len, pos —> sloi rd, zero, len, c.slli rd rd pos

However, it seems to me that it is better to simply use

li rd ((1 << len) -1) << pos

and teach the compiler/loader about sloi as one of its “myriad of sequences” to generate a constant.

Bikeshed:

If you want to keep BFMAK, can we call it BFMAKE or LBFI (for load bitfield immediate)

Please use ((-len-pos)%XLEN) instead of (XLEN - len -pos) and similarly for other shift amounts.

Bikeshed:

Maybe beginpos, (one past the) endpos as parameters for the BFEXT[U] macros is better than len, pos This should make no difference for code generation but avoids the confusion whether it is len, pos or pos, len.

Page 46: section 3.1.3

Select could be simplified with sbset

Select:

sbset a1, zero, a1

bdep a0, a1, a0

ctz a0, a0

ret

(Clever trick by the way)

Rank could be done with one less instruction with SLLNEG if it counts the number up to a1 (but that uses modular arithmetic mod XLEN and counts _all_ the bits if a1 = 0)

Rank:

sllneg a0, a1

pcnt a0 a0

ret

Page 46:

Can one use GREV instead of SHFL here?

One should really not have to use SHFL to pack bytes in a word. With PACK.BU and PACK.HU it becomes the straightforward

pack.bu a0 a0 a1 # pack least significant half word

pack.bu a2 a2 a3 # pack most significant half word

pack.hu a0 a0 a2 # pack word

Page 47

Bikeshed:

Please write

x = x >> (t % 64)

Even though, of course, &63 does the same as %64.

For a good comparison, it would be better to show that the (x-1) in x |~(x -1) (== x | (-x) ) requires clobbering a register.

another 4 instruction implementation not clobbering an extra register (but with a jump) is

uint64_t rfill_clzsro(uint64_t x)

{

if (x){ //JEQZ

x = clz(x); //CLZ

x = sro(0, x % 64); //SRO

x = ~x; //NOT

}

return x;

}

Page 48: 3.1.6 Round to next power of two

The number 0 is not a power of two, but this may be conventional use anyway.

This example would also work with SLLNEG instead of ROR

Page 50:

Comment:

If I count correctly, with SLLNEG and AND.NE but no FSR and CMOV the parse_27bit example would be equally long because AND.NE is CMOV with rs3 == zero and the FSR expands in 3 instructions, one of which is the SLL below the FSR and is computed already (I think).

Page 51: Using butterfly operations.

At this stage the reader might need to be reminded on the argument conventions CMIX ie a comment like

cmix a0, a1, a2, a0 # a0 = (a2 & a1) | (a0 &~ a1)

Page52

The butterfly function is undefined and seems to depend on parameters other than just the stage number.

Page 53 Using baseline Networks

Again the butterfly(0) seems underspecified and this section or how it relates to the instruction set is not clear.

Page 56. 3.4.3 Eplanation

Working this text into the corresponding instruction definition sections would be very helpful. For example you use the notation introduced in this section without explanation in the table of standard operations in the GZIP section.

IMO the explanation of what GZIP does is still suboptimal however.

Page 60

I think you mean remainder instead of the (German) rest.

Page 61

These decode examples could probably use a C code version as a companion. How did you construct them ?

As a OT side remark: To use an instruction decoder to emulate an instruction in M mode, even the obvious improvements provided by the B extension are probably not enough to make software emulation practical if the instruction has to be parsed from the beginning. It would seem much easier if upon a trap the processor parses the instruction format for which it has lots of hardware and provides CSR’s with

0 trapping address (this the CPU does already)

1 Instruction opcode including the full function fields

2 TRAP_IMPL address of an implementation function for this opcode or opcode_unimplemented function.

3 TRAP_RS1 Content rs1

4 TRAP_RS2 Content of rs2 or immediate

5 TRAP_RS3 content of rs3 if it exists

The trap could then do something like

CSRRW a0 TRAP_RS1 a0 #exchange a0 and TRAP_RS1; saves a0

CSRRW a1 TRAP_RS1 a1 #exchange a0 and TRAP_RS2; saves a1

CSRRW ra TRAP_IMPL ra #exchange ra and TRAP_IMPL; saves ra

JALR ra ra

CSRRW ra TRAP_IMPL ra #exchange ra and TRAP_IMPL; restores ra

CSRRW a1 TRAP_RS1 a1 #exchange a0 and TRAP_RS2; restores a1

CSRRW a0 TRAP_RS1 a0 #exchange a0 and TRAP_RS1; restores a0

MRET

Obviously this is naive and more song and dance is needed, at the trap boundary but the point is not to do the parsing of the instruction “by hand”. Also obviously outside the scope of the B extension.

==== END OF REVIEW====

xBitmanip V0.9 review.rtf

lkcl

unread,

Sep 2, 2019, 5:56:42 PM9/2/19

to RISC-V ISA Dev

On Tuesday, September 3, 2019 at 4:55:16 AM UTC+8, Rogier Brussee wrote:

> Is nand and nor not also worth it (the RVV mask instructions seem to have them for example) ?

SV has no opcodes. Therefore, BitManip having the exact same RVV "Mask" operations (except as bit manip not zero/nonzero elements) is absolutely critical.

So, yes, it is worthwhile and necessary, as are the other RVV "Mask" operations (converted to bitmanip).

Rogier there is an enormous amount here of very valuable comments. Going through them on a discussion forum is easily going to be overwhelming.

I would recommend an editable online docunent (wiki) where the sections notapproved/actioned can simply be deleted over time, just like a TODO List.

L.

lkcl

unread,

Sep 2, 2019, 7:36:57 PM9/2/19

to RISC-V ISA Dev

Ok so read everything, must re-read several times. First impressions:

Whilst ROR etc seem like a good idea for cryptographic applications, it was when I saw GF(2) MUL as well that I thought, "hang on a minute, the op count is getting alarmingly high. GFMUL is not exactly BitManip, it's a specialised function".

I would strongly suggest that anything not general purpose (with at least two common uses in completely different applications fields) NOT go into BitManip but into their own extension. Not even as a Zbitmanip*: out of BitManip entirely.

Otherwise BitManip becomes something of a "dumping ground" aka kitchen sink.

That risks increasing the workload for BitManip and, not only that, risks overloading OP32 and RVC, making further extensions extremely challenging.

Yes there is ISAMUX/ISANS however it is not to be abused, being a safety valve and/or applications specific enabler, not a general purpose OP-extender.

Also, ternary operators - three src operands and 1 dest - are incredibly expensive. The only way to encode four registers (4x5=20bits) is to take an ENTIRE major OP32.

I think there is only one 4 reg instruction in RV : FMAC.

RISC is not about putting in as much as possible, it is about doing as little as possible in the ISA and still being incredibly efficient and effective.

Ternary operators completely break that rule. Red flag.

The long immediates idea: separate discussion, reminder that for completeness a proper comparison is needed (using the existing instruction lengths)

Also, in the Mill Architecture it has "compressed" immediates. 0 is encoded in a single bit. -1 may be encoded in 9 bits.

My feeling is that any comparison of long immediate proposals should also include compression algorithms (simple ones, of the order of Huffmann encoding)

For another thread, that one.

L.

Jacob Lifshay

unread,

Sep 2, 2019, 9:12:37 PM9/2/19

to Rogier Brussee, RISC-V ISA Dev

On Mon, Sep 2, 2019, 13:55 Rogier Brussee <rogier....@gmail.com> wrote:

* If the C code definitions involving shifts I prefer to write something like rs1 << (rs2 % LEN) because the shifts in RV are mod LEN. wherever a shift is made in C code, to emphasise that shifts in RV are mod XLEN, 32 (for W) or 64 (for D). This also makes it slightly clearer that it makes sense to shift over negative numbers.

As a result I avoid definitions like shamt = rs2 & (LEN - 1) which, of course, is equivalent but conceptually less clear.

For the most common C implementations, a % b for negative a returns negative integers (see https://gcc.godbolt.org/z/r72dnQ ) which are undefined behavior as a shift amount according to most compilers (and the C standard) (for example, if the compiler compiled directly to 32-bit ARM assembly without checking for undefined behavior, 1 << (-1 % 16) returns 0 on ARM, rather than 0x10000, since -1 % 16 == -1 and shifts on ARM reduce the shift amount mod 0x100 (rather than mod 32) producing a shift by 255). % is generally called the remainder operation instead of mod. Using the mask allows always getting non-negative results, avoiding shifting by a negative number. maybe it would be worthwhile to define an auxillary function mod to calculate the modulo:

int mod(int a, int b) // replace types as necessary

{

int rem = a % b;

return (rem + b) % b;

}

it would be used in the shift operations as follows:

unsigned shift_demo(unsigned a, int b)

{

return (a >> mod(b, LEN)) | (a << mod(-b, LEN));

}

Jacob Lifshay

unread,

Sep 2, 2019, 9:24:18 PM9/2/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev

On Mon, Sep 2, 2019, 16:37 lkcl <luke.l...@gmail.com> wrote:

Ok so read everything, must re-read several times. First impressions:

Whilst ROR etc seem like a good idea for cryptographic applications, it was when I saw GF(2) MUL as well that I thought, "hang on a minute, the op count is getting alarmingly high. GFMUL is not exactly BitManip, it's a specialised function".

GF(2) MUL is the same as carry-less multiply, as proposed in http://bugs.libre-riscv.org/show_bug.cgi?id=134

as outlined there, carry-less multiply is useful for cryptography, CRCs (networking, etc), implementing LFSRs in software, some hash functions (for hash-tables), etc.

Because a carry-less multiply instruction improves speed by a large amount for its most common applications, which are widely used, I would argue that carry-less multiply is almost essential for any server processor, and it would be foolish to leave it out of bitmanip (but not required for all implementations, a small microcontroller probably doesn't need it).

Jacob Lifshay

Rogier Brussee

unread,

Sep 3, 2019, 2:06:40 AM9/3/19

to RISC-V ISA Dev, rogier....@gmail.com

Op dinsdag 3 september 2019 03:12:37 UTC+2 schreef Jacob Lifshay:

On Mon, Sep 2, 2019, 13:55 Rogier Brussee <rogier...@gmail.com> wrote:

* If the C code definitions involving shifts I prefer to write something like rs1 << (rs2 % LEN) because the shifts in RV are mod LEN. wherever a shift is made in C code, to emphasise that shifts in RV are mod XLEN, 32 (for W) or 64 (for D). This also makes it slightly clearer that it makes sense to shift over negative numbers.

As a result I avoid definitions like shamt = rs2 & (LEN - 1) which, of course, is equivalent but conceptually less clear.
For the most common C implementations, a % b for negative a returns negative integers (see https://gcc.godbolt.org/z/r72dnQ ) which are undefined behavior as a shift amount according to most compilers (and the C standard) (for example, if the compiler compiled directly to 32-bit ARM assembly without checking for undefined behavior, 1 << (-1 % 16) returns 0 on ARM, rather than 0x10000, since -1 % 16 == -1 and shifts on ARM reduce the shift amount mod 0x100 (rather than mod 32) producing a shift by 255). % is generally called the remainder operation instead of mod. Using the mask allows always getting non-negative results, avoiding shifting by a negative number. maybe it would be worthwhile to define an auxillary function mod to calculate the modulo:

The equivalence of "%N" and "&(N-1)" holds for _unsigned_ integers uint_xlen_t which do modular artitmatic mod 2^XLEN and N a power of two (such as XLEN = 2^5 or 2^6). Negative numbers like -8 still make perfect sense of course.

Jacob Lifshay

unread,

Sep 3, 2019, 2:16:59 AM9/3/19

to Rogier Brussee, RISC-V ISA Dev

On Mon, Sep 2, 2019, 23:06 Rogier Brussee <rogier....@gmail.com> wrote:

Op dinsdag 3 september 2019 03:12:37 UTC+2 schreef Jacob Lifshay:
On Mon, Sep 2, 2019, 13:55 Rogier Brussee <rogier...@gmail.com> wrote:

* If the C code definitions involving shifts I prefer to write something like rs1 << (rs2 % LEN) because the shifts in RV are mod LEN. wherever a shift is made in C code, to emphasise that shifts in RV are mod XLEN, 32 (for W) or 64 (for D). This also makes it slightly clearer that it makes sense to shift over negative numbers.

As a result I avoid definitions like shamt = rs2 & (LEN - 1) which, of course, is equivalent but conceptually less clear.
For the most common C implementations, a % b for negative a returns negative integers (see https://gcc.godbolt.org/z/r72dnQ ) which are undefined behavior as a shift amount according to most compilers (and the C standard) (for example, if the compiler compiled directly to 32-bit ARM assembly without checking for undefined behavior, 1 << (-1 % 16) returns 0 on ARM, rather than 0x10000, since -1 % 16 == -1 and shifts on ARM reduce the shift amount mod 0x100 (rather than mod 32) producing a shift by 255). % is generally called the remainder operation instead of mod. Using the mask allows always getting non-negative results, avoiding shifting by a negative number. maybe it would be worthwhile to define an auxillary function mod to calculate the modulo:

The equivalence of "%N" and "&(N-1)" holds for _unsigned_ integers uint_xlen_t which do modular artitmatic mod 2^XLEN and N a power of two (such as XLEN = 2^5 or 2^6). Negative numbers like -8 still make perfect sense of course.

missed that the types are unsigned, in which case you are correct. sorry for the confusion.

int mod(int a, int b) // replace types as necessary
{
int rem = a % b;
return (rem + b) % b;
}

it would be used in the shift operations as follows:

unsigned shift_demo(unsigned a, int b)
{
return (a >> mod(b, LEN)) | (a << mod(-b, LEN));
}

Jacob Lifshay

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/af41ad27-5eb1-4959-94ca-3341e2f3e057%40groups.riscv.org.

Clifford Wolf

unread,

Sep 4, 2019, 8:36:22 AM9/4/19

to Rogier Brussee, RISC-V ISA Dev

Hi,

Review of Xbitmanip V0.9 (June 10)

I’m assuming that means RISC-V Bitmanip v0.90. (That’s the June 10 release.)

Just a few days ago I released RISC-V Bitmanip v0.91. (Note that it’s not called XBitmanip anymore ever since we became an official RISC-V task group again.)

You can always find the latest version of the bitmanip spec (or something very close to the latest version) at

https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf

Your review is over 20 pages long, so I’m keeping my replies to what I think are the most important points.

It took me over 5 hours to work through your entire review and write the reply below.

So I hope you can understand that I don’t have the time to reply to everything you wrote.

Please open an issue at https://github.com/riscv/riscv-bitmanip/issues/new if you feel like any one suggestion I skipped or dismissed in my reply should see more debate.

I’m skipping a lot of stuff that’s imo bikeshedding (including a lot of stuff that you have not marked as bikeshedding). I’ve also skipped a lot of proposals to re-do large portions of riscv-bitmanip as it’s about a year too late for that.

For many sections it is not quite clear what the W versions (for RV64) or D versions (for RV128) are supposed to do. I will try to indicate how I understand it works.

The general semantic and reasoning behind *W instructions is explained in the base ISA spec. They always just perform the RV32 operation on RV64. I don’t see it in the scope of the Bitmanip spec to explain *W instructions as a concept.

It would be useful if a rationale for the use of an instruction would be given and quantitative data on use were available.

We are working on that. Right now we are building the tools. Can’t benchmark anything if you can generate the code.

Many sections contain C code as a way to define what the instruction is supposed to be doing. I like this because it is usually much easier to follow than words. However, I think there is value in making the definition as conceptually simple as possible even if it is (much) slower, and I will suggest some. Can we move faster but less intuitive implementations to an appendix?

These ARE the simple versions. The fast versions can be found in the “Fast C reference implementations” section within the “Reference Implementations” chapter.

* If the order of a loop is not important I use this macro
/* for loop that does not care about order: fake it by running backwards */
#define FOREACH( I , BEGIN, END) for( I = (END) - 1 ; I >= (BEGIN) ; —I)

You can’t be serious. I think we just have very different ideas of what “simple” C code is. The reference implementations won’t be changed to use stuff like that.

Comment: It seems to me that the B extension has moved to “integer instructions that extend or optimise the basic I instructions. I think this is a good thing, and I would argue that the B extension should position as slightly cheaper than the M extension, [...]

That’s what the Zbb extension is.

For the W versions set LEN = 32.

I’m not adding that to the description of every instruction. I’ve now added one paragraph to the introduction that reiterates what *W instructions do.

Please add
Clz and ctz are used for finding the first elements of a bit map. In particular it is used for the implementation of malloc. Clz is also used for soft float

Tbh I don’t think someone reading this document would need to be educated about the usefulness of clz/ctz, but I’ve added a few more applications to the clz/ctz section now.

Comment: I like that all basic logical operations have a “n” version.
Is nand and nor not also worth it (the RVV mask instructions seem to have them for example) ?

Because of instructions like SUB there is an existing input inverter that can be used for ANDN, ORN, and XNOR, controlled by the same instruction bit that is used in those new instructions. The cost of adding ANDN, ORN, and XNOR is virtually zero. The cost of adding NAND/NOR would not be.

please add
The pack instruction can be used to pack “struct” data that fits in a XLEN sized register and is passed in registers.

I’ve added language to that effect.

Proposal:
I really like the pack instructions. However I would propose to have 4 and XLEN independent pack instructions:
pack.bu rd rs1 rs2 # rd = zext(rs1, 8) | ( zext(rs2, 8) << 8 )
pack.hu rd rs1 rs2 # rd = zext(rs1, 16) | (zext(rs2, 16) << 16)
pack.wu rd rs1 rs2 # rd = zext(rs1, 32) | (zext(rs2, 32) << 32). //(RV64)
pack.du. rd rs1 rs2 # rd = zext(rs1,64) | (zext(rs2, 64) << 64) //(RV64, 128)

The only really new instruction in that list is pack.bu, and it replaces only two instructions (pack+shuffle).

Your pack.hu is just packw (RV64) or pack (RV32), except that packw fills the upper 32-bit with sign extension.

Your pack.wu is just pack on RV64 (or packd on RV128).

Your pack.du presumably is just packw on RV128.

Page 7.
Comment.
One should not need to use shfl instructions to pack bytes. See proposal for pack.bu.

One should not need a pack.bu instruction to pack bytes. See shfl instructions. ;)

The ZEXT pseudo instruction could also use ADDWU. Also this is (as far as I can see) the only mention of RV128. It should perhaps be mentioned everywhere (preferably) or nowhere.

There’s nothing to gain by using ADDWU instead of PACK[D] for ZEXT.W. Using PACK for all ZEXT where it’s possible is more consistent.

Any mention of RV128 in the bitmanip spec is just commentary considering that there is not even a ratified RV128 base ISA spec.

When I mention RV128 it is only show that we have considered implications for RV128. (For example wrt encoding of shift-like instructions.)

Page 7 section 2.1.5 Min/Max
Proposal:
I would propose to also add SGE and SGEU. As is done in RI5CY/PULP, this would be nicely placed with the SLT and SLTU instructions (i.e. have the same func3 bits as SLT and SLTU see proposal in section 2.1.3 ) making each a 2 bit family

That would replace a sequence of only two easily fusable instructions. (SLT/SLTU + XORI)

A while back I disassembled the entire debian base image to figure out what percentage of instructions would be sequences matching ==, !=, ||, or &&, to estimate if it would be worth adding instructions for those. This was the result:

total 10489586

seteq 634 (0.0060%)

setne 1856 (0.0177%)

lor 353 (0.0034%)

lnor 243 (0.0023%)

I have not done a similar experiment for SGE or SGEU.

You are welcome to perform a similar experiment for SGE/SGEU and propose the instructions via github issue if it turns out they’d see a lot of usage. Here are my scripts for reference:

http://svn.clifford.at/handicraft/2019/rvlogicops/

Page 8
Taking the absolute value of an integer is a generally useful operation and I think it is not needed to mention SAT solvers.

I’ve made the language a bit more generic. But note that even though it’s a generally useful operation, there are few applications that need it so frequently that it has measurable performance impacts. (That is of course in big part because calculating abs() isn’t very expensive even without min/max.) Therefore I think it does make sense to mention SAT solvers as a concrete application.

Page 8. Section 2.1.6 sbset/sbclr/sbinv/sbext
Please add:
These instructions efficiently support bitfields, and avoid costly constant constructions and nasty surprises especially if the bitfields are not in low bits.

In the context of bit manipulation instructions, the term “bitfield” is commonly used in the context of instructions that manipulate contiguous regions of bits. (See for example the ARM instructions BFC, BFI, SBFX, etc.)

These are single bit instructions, not bitfield instructions.

I think the s prefix should be reserved for instructions returning booleans (0/1), which means only SBEXT is properly named. I would suggest BSET1, BCLR1, BINV1, and SBEXT1 or SBGET1 (but see below)

You don’t even say why you feel that the “S” prefix should be preserved for such instructions..

Also, to late: There’s already SB SH SW SLL[I] SRL[I] SRA[i] and the SEXT.W pseudo-instruction in the base ISA.

What is meant with “bitwise prefix sum of the bits in the lower byte of a 32 bit word on RV32” here?

https://en.wikipedia.org/wiki/Prefix_sum

https://en.wikipedia.org/wiki/Bitwise_operation

The example code in question is described as following in the spec:

“For example, the following code calculates the bitwise prefix sum of the bits

in the lower byte of a 32 bit word on RV32. The final prefix sum is stored in the 8 nibbles of the output word.”

I.e. when A[i] is bit i of the input, and B[i] is nibble i of the output, then

B[0] = A[0]

B[1] = A[0] + A[1]

B[2] = A[0] + A[1] + A[2]

etc.

Since you already say that they often are not faster than software emulation, and one of the first hits you get for PEXT is that AMD implements them horribly slow, maybe they should not be in B but still be defined as Z extension.

Did you even read the introduction section of the spec chapter?

The CLMUL and CLMULH/ CLMULR instructions implement polynomial multiplication over the Galois field GF(2) and the closely related multiplication in the higher Galois fields GF(2^n). They are used extensively in cryptography and can be used to compute CRC’s which are defined by computations in Galois fields.

They don’t implement multiplication over GF(2^n). That would be carry-less product followed by reduction. So implementing GF(2^n) products is an _application_ of clmul, but it is not what clmul does. Using Barrett reduction a GF(2^n) product can be implemented with 4 clmul operations: Two clmul for the polynomial GF(2) multiplication and two more for the reduction.

I’ve added the sentence “Carry-less multiplication is equivalent to multiplication in the polynomial ring over GF(2)”.

Page 24
S/GCM/GCD/GCM (greatest common divisor/multiple) of binary polynomials/

GCM means Galois/Counter Mode here.

I’ve replaced “CRC and GCM” with “Cyclic Redundancy Check (CRC) and Galois/Counter Mode (GCM)”.

What is the prefix XOR operation?

for (i=0..XLEN-1): out[i] = ^in[i:0]

where X[i] is bit number i, X[i:0] is the i+1 LSB bits and ^X is XOR reduce.

Like Prefix-sum but with xor instead of addition.

Isn’t a CLMUL instruction easier and cheaper than a MULT instruction, so may be implemented even if M is not implemented? ( I may be wrong here and obviously a MULT instruction has more general benefit).

It is cheaper but not so cheap that it’s likely that an implementation would bother with fast CLMUL if they don’t even bother with fast MUL.

I like the binary metrix operations. However, maybe it is enough to just define the more primitive multiplication between a matrix and a vector. We can then also have different sized (power of two by power of two) sized matrices multiplied to vectors.

The HW cost for bmat[x]or is not as big as one might expect.

My reference single-cycle implementation of bmat[x]or mapped to a generic 4-LUT FPGA architecture is 706 LUTs in size. That’s less than the size of five 64-bit adders.

Being able to for example perform an arbitrary permutation of bytes in a word in a single instruction is interesting. Doing it in 8 instructions doesn’t really add much.

Ternary operators have pushback because they are a burden on the micro architecture

That’s why they are in a separate Z extension.

Does changing the order of the registers in the assembler compared to the encoding not create more confusion than it solves (I genuinely don’t know).

Feel free to discuss at https://github.com/riscv/riscv-bitmanip/issues/11

Apparently it would be “insane” not to do it. Let me know when you two have agreed on whether changing the order is “confusing” or not changing it would be “insane”..

shows that the immediate versions(where the compiler can do the test) are just three instructions

and we have very long discussion within the task group and ultimately decided that it would be worth to include them anyway. I agree with that decision.

What does ADDIWU do? Does it zero or sign extend its 12 bits immediate? (I would suggest zero extending is more useful as it increases the range of constants that can be defined in one instruction)

The immediate is sign-extended, exactly like the immediate to ADDIW. I’ve now added a sentence clarifying that.

The i^j example is not terribly good as i^j happens to be already zero extended, if i and j are properly sign extended.

No. i^j is sign extended if i and j are sign extended. (That’s why there is no XORW instruction.)

For example: 0x80000000 ^ 0x00000000

With sign extended inputs:

0xFFFFFFFF80000000 ^ 0x0000000000000000 = 0xFFFFFFFF80000000

The SLLIU.W instruction is used to compute addresses of array elements indexed by unsigned integers.

All five “unsigned address calculation instructions” are.

Comment:
Is there no SLLU.W (aka SLL.WU) instruction?

What would we need that for?

It might be slightly better to have rs1 zero extended instead of rs2. That leaves open the possibility to have a slightly more regular C instruction

That would zero extend the wrong input for subu.w if the idea is to use it for the address calculation use-case.

I think the scheme could and should be extended to bytes and halves. [..]

Together that gives the following add instructions :
#add #addw #addd
addsl1 addwsl1 adddsl1
addsl2 addwsl2 adddsl2
addsl3 addwsl3 adddsl3
addsl4 addwsl4 adddsl4
addsl5 addwsl5 adddsl5
addsl6 addwsl6 adddsl6
addsl7 addwsl7 adddsl7
add.b addw.b addd.b
add.h addw.h addd.h
add.w addd.w
add.d
add.bu addwu.bu adddu.bu
add.hu addwu.hu adddu.hu
add.wu addwu adddu.wu
add.du adddu
With the pattern repeating for sub.

If I counted correctly that would mean adding 82 new R-type instructions. That’s over half of a minor opcode. You can’t be serious.

Page 32, Section 2.10 Future compressed instructions
I would take this section out, or just suggest that C.NOT, C.NEG and C.ZEXT.W and C.ZEXT.D may be worthwhile additions. More quantitative data is needed however, and this just cannot be done just for the B extension. For example I think another instruction that may be worthwhile is

Did you read the part where it says this is a recommendation for a future revision of “C”, not part of the “B” extension?

Page 41:
S/32-bit numbers that have their MSB set/zero extended 32-bit numbers/
(Ie. they don’t have there MSB set but _unset_!)

Here is a 32-bit number with its MSB set: 0x82341234

Sign extended: 0xFFFFFFFF82341234

Zero extended: 0x0000000082341234

You can’t load the zero extended version on RV64 with LUI+ADDI.

But you can load it with LUI+ADDIWU.

Page 42 Section 2.12.4
I think the extraction sequence should be
slli rd rs1 (-(pos + len)) % XLEN
srli rd rd (-len) % XLEN

I think taking the remainder of a negative number is a bad idea in many contexts, but especially in something like a specification, considering that in languages like C the behavior is implementation-defined.

this always makes sense, and does “the right thing” for 0 <= pos + len <= XLEN, pos >= 0, and len >= 0 except pos == len == 0 :-(.

I’m not even sure what you imply “the right thing” is for len=0, and pos!=0.

I think the insertion sequence of a bit field of length len starting at bit 0 is wrong and should be
slli rd rs1 (-len) % XLEN
srli rd rd (-(len + pos)) % XLEN

I fixed the issue, but kept my form that does not require modulo of negative numbers.

This also does “the right thing” except for pos == len == 0

Again, not seeing how leaving the register unchanged (left shift by zero, right shift by zero) is “the right thing”.

The elephant in the room is to what extent to standardise if compilers should recognise recommended fused sequences, what sequences to recommend, and to what extent compiler writers are supposed to respect these sequences.

There is no requirement for a compiler or for a processor to implement any of those.

The BFMAK macro definitions seems wrongly defined and I think it should be simply
bfmak rd, len, pos —> sloi rd, zero, len, c.slli rd rd pos

That’s just an equivalent definition. The difference is that your definition does not match the *-srli/*-srai fusion pattern, but the definition in the spec does.

And the description for that list of three pseudo-ops literally says “bitfield pseudo-ops for sr[la]i postfix fusion”, so I think the sequences should actually match that fusion pattern.

Please use ((-len-pos)%XLEN) instead of (XLEN - len -pos) and similarly for other shift amounts.

I’m not going to use C-like expressions that rely on implementation defined C behavior in a spec document.

Page 46: section 3.1.3
Select could be simplified with sbset

That’s correct. That example was written before we added SB* instructions. I’ve changed it now accordingly.

Page 46:
Can one use GREV instead of SHFL here?

I don’t think so. I wouldn’t know how.

One should really not have to use SHFL to pack bytes in a word. With PACK.BU and PACK.HU it becomes the straightforward

You are saying that as if it would be obvious what’s bad about using shuffle for that. It’s not obvious to me. Why do you think we need to invent a new instructions to do something that can already be done with SHFL?

Page 48: 3.1.6 Round to next power of two
The number 0 is not a power of two, but this may be conventional use anyway.

That’s why the text contains the clause “i.e. equivialent to rfill(x-1)+1.”

Page52
The butterfly function is undefined and seems to depend on parameters other than just the stage number.

The first paragraph in that section defines what a stage-N butterfly operation is.

Page 60
I think you mean remainder instead of the (German) rest.

Oops. :D Fixed.

Page 61
These decode examples could probably use a C code version as a companion. How did you construct them ?

I created them with the use of an SMT solver.

Regards,

- Clifford

Rogier Brussee

unread,

Sep 4, 2019, 3:37:37 PM9/4/19

to RISC-V ISA Dev, rogier....@gmail.com

Op woensdag 4 september 2019 14:36:22 UTC+2 schreef clifford:

Hi,
Review of Xbitmanip V0.9 (June 10)

I’m assuming that means RISC-V Bitmanip v0.90. (That’s the June 10 release.)

Yes.

Just a few days ago I released RISC-V Bitmanip v0.91. (Note that it’s not called XBitmanip anymore ever since we became an official RISC-V task group again.)

You can always find the latest version of the bitmanip spec (or something very close to the latest version) at
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf

Nice!

Your review is over 20 pages long, so I’m keeping my replies to what I think are the most important points.
It took me over 5 hours to work through your entire review and write the reply below.
So I hope you can understand that I don’t have the time to reply to everything you wrote.

Of course. The review took longer than I thought too :-(

Please open an issue at https://github.com/riscv/riscv-bitmanip/issues/new if you feel like any one suggestion I skipped or dismissed in my reply should see more debate.

I’m skipping a lot of stuff that’s imo bikeshedding (including a lot of stuff that you have not marked as bikeshedding). I’ve also skipped a lot of proposals to re-do large portions of riscv-bitmanip as it’s about a year too late for that.

For many sections it is not quite clear what the W versions (for RV64) or D versions (for RV128) are supposed to do. I will try to indicate how I understand it works.

The general semantic and reasoning behind *W instructions is explained in the base ISA spec. They always just perform the RV32 operation on RV64. I don’t see it in the scope of the Bitmanip spec to explain *W instructions as a concept.

It is not the concept of W, which is perfectly clear, it is that you mention XLEN specifically in the C code (which should change to 32 for W) and uint_xlen_t (which should just stay what it is) and standards need to be excruciatingly explicit.

It would be useful if a rationale for the use of an instruction would be given and quantitative data on use were available.

We are working on that. Right now we are building the tools. Can’t benchmark anything if you can generate the code

Great!

.

Many sections contain C code as a way to define what the instruction is supposed to be doing. I like this because it is usually much easier to follow than words. However, I think there is value in making the definition as conceptually simple as possible even if it is (much) slower, and I will suggest some. Can we move faster but less intuitive implementations to an appendix?

These ARE the simple versions. The fast versions can be found in the “Fast C reference implementations” section within the “Reference Implementations” chapter.

* If the order of a loop is not important I use this macro
/* for loop that does not care about order: fake it by running backwards */
#define FOREACH( I , BEGIN, END) for( I = (END) - 1 ; I >= (BEGIN) ; —I)

You can’t be serious. I think we just have very different ideas of what “simple” C code is. The reference implementations won’t be changed to use stuff like that.

The rationale is not to save a few keystrokes but to indicate where order matters and where not, while still being runnable. I can see your point though.

Comment: It seems to me that the B extension has moved to “integer instructions that extend or optimise the basic I instructions. I think this is a good thing, and I would argue that the B extension should position as slightly cheaper than the M extension, [...]

That’s what the Zbb extension is.

Good point. Perhaps it is good to explicitly write that down.

For the W versions set LEN = 32.

I’m not adding that to the description of every instruction. I’ve now added one paragraph to the introduction that reiterates what *W instructions do.

OK.

Please add
Clz and ctz are used for finding the first elements of a bit map. In particular it is used for the implementation of malloc. Clz is also used for soft float

Tbh I don’t think someone reading this document would need to be educated about the usefulness of clz/ctz, but I’ve added a few more applications to the clz/ctz section now.

I think it is useful background info for all instructions to indicate important use cases.

Comment: I like that all basic logical operations have a “n” version.
Is nand and nor not also worth it (the RVV mask instructions seem to have them for example) ?

Because of instructions like SUB there is an existing input inverter that can be used for ANDN, ORN, and XNOR, controlled by the same instruction bit that is used in those new instructions. The cost of adding ANDN, ORN, and XNOR is virtually zero. The cost of adding NAND/NOR would not be.

please add
The pack instruction can be used to pack “struct” data that fits in a XLEN sized register and is passed in registers.

I’ve added language to that effect.

OK !

Proposal:
I really like the pack instructions. However I would propose to have 4 and XLEN independent pack instructions:
pack.bu rd rs1 rs2 # rd = zext(rs1, 8) | ( zext(rs2, 8) << 8 )
pack.hu rd rs1 rs2 # rd = zext(rs1, 16) | (zext(rs2, 16) << 16)
pack.wu rd rs1 rs2 # rd = zext(rs1, 32) | (zext(rs2, 32) << 32). //(RV64)
pack.du. rd rs1 rs2 # rd = zext(rs1,64) | (zext(rs2, 64) << 64) //(RV64, 128)

The only really new instruction in that list is pack.bu, and it replaces only two instructions (pack+shuffle).

The alternative view is that it organises the PACK, PACKW and PACKD instructions in a different way and complements it with PACK.BU that is already useful for RV32 making it a 2 bit family.

In any case seems more primitive than shuffle. Shuffle is not in Zbb according to the diagram on page 3 in v0.91. PACK is.

I admit that ANDI is perfectly good for ZEXT.B though.

Your pack.hu is just packw (RV64) or pack (RV32), except that packw fills the upper 32-bit with sign extension.

Right. It zero extends because that is needed for the use case of building

struct foo{

short low, high;

}

as a structure to pass in a XLEN sized register.

Your pack.wu is just pack on RV64 (or packd on RV128).

Right (except packd would sign extend)

Your pack.du presumably is just packw on RV128.

No, it would be PACK on RV128.

Page 7.
Comment.
One should not need to use shfl instructions to pack bytes. See proposal for pack.bu.

One should not need a pack.bu instruction to pack bytes. See shfl instructions. ;)

The ZEXT pseudo instruction could also use ADDWU. Also this is (as far as I can see) the only mention of RV128. It should perhaps be mentioned everywhere (preferably) or nowhere.

There’s nothing to gain by using ADDWU instead of PACK[D] for ZEXT.W. Using PACK for all ZEXT where it’s possible is more consistent.

If you don't have PACK but do have ADDWU, you still have ZEXT.W. PACK is in Zbb though, so it is probably moot.

Any mention of RV128 in the bitmanip spec is just commentary considering that there is not even a ratified RV128 base ISA spec.

When I mention RV128 it is only show that we have considered implications for RV128. (For example wrt encoding of shift-like instructions.)

Page 7 section 2.1.5 Min/Max
Proposal:
I would propose to also add SGE and SGEU. As is done in RI5CY/PULP, this would be nicely placed with the SLT and SLTU instructions (i.e. have the same func3 bits as SLT and SLTU see proposal in section 2.1.3 ) making each a 2 bit family

That would replace a sequence of only two easily fusable instructions. (SLT/SLTU + XORI)

True, it just fits nicely as a 2 bit family with SLT MAX and MIN which is, I guess, why the RI5CY people have it.

A while back I disassembled the entire debian base image to figure out what percentage of instructions would be sequences matching ==, !=, ||, or &&, to estimate if it would be worth adding instructions for those. This was the result:

total 10489586
seteq 634 (0.0060%)
setne 1856 (0.0177%)
lor 353 (0.0034%)
lnor 243 (0.0023%)

Cool You must have done more such experiments, and perhaps it worthwhile posting them.

I have not done a similar experiment for SGE or SGEU.

You are welcome to perform a similar experiment for SGE/SGEU and propose the instructions via github issue if it turns out they’d see a lot of usage. Here are my scripts for reference:

http://svn.clifford.at/handicraft/2019/rvlogicops/

OK

Page 8
Taking the absolute value of an integer is a generally useful operation and I think it is not needed to mention SAT solvers.

I’ve made the language a bit more generic. But note that even though it’s a generally useful operation, there are few applications that need it so frequently that it has measurable performance impacts. (That is of course in big part because calculating abs() isn’t very expensive even without min/max.) Therefore I think it does make sense to mention SAT solvers as a concrete application.

Maybe it is just me, but SAT solvers look like a bit of a niche application to me.

Page 8. Section 2.1.6 sbset/sbclr/sbinv/sbext
Please add:
These instructions efficiently support bitfields, and avoid costly constant constructions and nasty surprises especially if the bitfields are not in low bits.

In the context of bit manipulation instructions, the term “bitfield” is commonly used in the context of instructions that manipulate contiguous regions of bits. (See for example the ARM instructions BFC, BFI, SBFX, etc.)

These are single bit instructions, not bitfield instructions.

Good point. I meant bitfield in the sense of the C language, i.e. an example like

struct nasty{

unsigned long long counter:62;

unsigned flag1: 1; // ouch, have to manipulate bit 62 and 63, good luck if your code depends on nasty being read in one go.

unsigned flag2: 1:

}

I think the s prefix should be reserved for instructions returning booleans (0/1), which means only SBEXT is properly named. I would suggest BSET1, BCLR1, BINV1, and SBEXT1 or SBGET1 (but see below)

You don’t even say why you feel that the “S” prefix should be preserved for such instructions..

s as in set as in set less than (slt) and set less than unsigned (sltu) which are (currently) the boolean operators

Also, to late: There’s already SB SH SW SLL[I] SRL[I] SRA[i] and the SEXT.W pseudo-instruction in the base ISA.

Yes :-(

What is meant with “bitwise prefix sum of the bits in the lower byte of a 32 bit word on RV32” here?

https://en.wikipedia.org/wiki/Prefix_sum
https://en.wikipedia.org/wiki/Bitwise_operation

The example code in question is described as following in the spec:

“For example, the following code calculates the bitwise prefix sum of the bits
in the lower byte of a 32 bit word on RV32. The final prefix sum is stored in the 8 nibbles of the output word.”

I.e. when A[i] is bit i of the input, and B[i] is nibble i of the output, then

B[0] = A[0]
B[1] = A[0] + A[1]
B[2] = A[0] + A[1] + A[2]
etc.

If that is now written it is perfectly clear.

Since you already say that they often are not faster than software emulation, and one of the first hits you get for PEXT is that AMD implements them horribly slow, maybe they should not be in B but still be defined as Z extension.

Did you even read the introduction section of the spec chapter?

Yes, and it has/had the BEXT /BDEP in the grey-shaded "suggested to be in B" block. But I guess one could implement Zbb instead.

The CLMUL and CLMULH/ CLMULR instructions implement polynomial multiplication over the Galois field GF(2) and the closely related multiplication in the higher Galois fields GF(2^n). They are used extensively in cryptography and can be used to compute CRC’s which are defined by computations in Galois fields.

They don’t implement multiplication over GF(2^n). That would be carry-less product followed by reduction. So implementing GF(2^n) products is an _application_ of clmul, but it is not what clmul does. Using Barrett reduction a GF(2^n) product can be implemented with 4 clmul operations: Two clmul for the polynomial GF(2) multiplication and two more for the reduction.

That is a better formulation. I am just too used to thinking about equivalence classes mod an ideal.

I’ve added the sentence “Carry-less multiplication is equivalent to multiplication in the polynomial ring over GF(2)”.

Great! I will get a hanky and and will just sniff a bit over calling them carry-less multiplication :).

Page 24
S/GCM/GCD/GCM (greatest common divisor/multiple) of binary polynomials/

GCM means Galois/Counter Mode here.

I’ve replaced “CRC and GCM” with “Cyclic Redundancy Check (CRC) and Galois/Counter Mode (GCM)”.

OK!

What is the prefix XOR operation?

for (i=0..XLEN-1): out[i] = ^in[i:0]

where X[i] is bit number i, X[i:0] is the i+1 LSB bits and ^X is XOR reduce.

Like Prefix-sum but with xor instead of addition.

OK should now be clear.

Isn’t a CLMUL instruction easier and cheaper than a MULT instruction, so may be implemented even if M is not implemented? ( I may be wrong here and obviously a MULT instruction has more general benefit).

It is cheaper but not so cheap that it’s likely that an implementation would bother with fast CLMUL if they don’t even bother with fast MUL.

Fair enough.

I like the binary metrix operations. However, maybe it is enough to just define the more primitive multiplication between a matrix and a vector. We can then also have different sized (power of two by power of two) sized matrices multiplied to vectors.

The HW cost for bmat[x]or is not as big as one might expect.

My reference single-cycle implementation of bmat[x]or mapped to a generic 4-LUT FPGA architecture is 706 LUTs in size. That’s less than the size of five 64-bit adders.

Being able to for example perform an arbitrary permutation of bytes in a word in a single instruction is interesting. Doing it in 8 instructions doesn’t really add much.

OK.

Ternary operators have pushback because they are a burden on the micro architecture

That’s why they are in a separate Z extension.

Does changing the order of the registers in the assembler compared to the encoding not create more confusion than it solves (I genuinely don’t know).

Feel free to discuss at https://github.com/riscv/riscv-bitmanip/issues/11

Apparently it would be “insane” not to do it. Let me know when you two have agreed on whether changing the order is “confusing” or not changing it would be “insane”..

Ha Ha OK :-)

shows that the immediate versions(where the compiler can do the test) are just three instructions

and we have very long discussion within the task group and ultimately decided that it would be worth to include them anyway. I agree with that decision.

OK

What does ADDIWU do? Does it zero or sign extend its 12 bits immediate? (I would suggest zero extending is more useful as it increases the range of constants that can be defined in one instruction)

The immediate is sign-extended, exactly like the immediate to ADDIW. I’ve now added a sentence clarifying that.

OK.

The i^j example is not terribly good as i^j happens to be already zero extended, if i and j are properly sign extended.

No. i^j is sign extended if i and j are sign extended. (That’s why there is no XORW instruction.)

For example: 0x80000000 ^ 0x00000000

With sign extended inputs:
0xFFFFFFFF80000000 ^ 0x0000000000000000 = 0xFFFFFFFF80000000

I am an idiot :-(.

Still think the expression p[I] is a better example than p [I^j]

The SLLIU.W instruction is used to compute addresses of array elements indexed by unsigned integers.

All five “unsigned address calculation instructions” are.

Good point.

Comment:
Is there no SLLU.W (aka SLL.WU) instruction?

What would we need that for?

regularity of the ISA. I mainly asked because it surprised me and it looked like a simple oversight.

It might be slightly better to have rs1 zero extended instead of rs2. That leaves open the possibility to have a slightly more regular C instruction

That would zero extend the wrong input for subu.w if the idea is to use it for the address calculation use-case.

makes sense.

I think the scheme could and should be extended to bytes and halves. [..]
Together that gives the following add instructions :
#add #addw #addd
addsl1 addwsl1 adddsl1
addsl2 addwsl2 adddsl2
addsl3 addwsl3 adddsl3
addsl4 addwsl4 adddsl4
addsl5 addwsl5 adddsl5
addsl6 addwsl6 adddsl6
addsl7 addwsl7 adddsl7
add.b addw.b addd.b
add.h addw.h addd.h
add.w addd.w
add.d
add.bu addwu.bu adddu.bu
add.hu addwu.hu adddu.hu
add.wu addwu adddu.wu
add.du adddu
With the pattern repeating for sub.

If I counted correctly that would mean adding 82 new R-type instructions. That’s over half of a minor opcode. You can’t be serious.

I am serious. That there are so many ADD's comes from making it two families and internal logic of the encoding that says that if you have an instruction land the instruction doesn't preserve sign extension you must also have the corresponding W and D versions and it is cheap to do so because the proper sign extension hardware is there anyway. But note that every single one of them corresponds directly to (one of the admitted many) C language + operations and add's are bread and butter so it is OK if it takes up room. I think avoiding nasty surprises and being able to support C operations essentially 1 to 1 in the ISA is an important goal. One could reduce the size by half by ditching the sl and perhaps the b/bu versions

Page 32, Section 2.10 Future compressed instructions
I would take this section out, or just suggest that C.NOT, C.NEG and C.ZEXT.W and C.ZEXT.D may be worthwhile additions. More quantitative data is needed however, and this just cannot be done just for the B extension. For example I think another instruction that may be worthwhile is

Did you read the part where it says this is a recommendation for a future revision of “C”, not part of the “B” extension?

I did, and wanted to suggest taking the whole section out, and got carried away clarifying and toning down :-(.

Page 41:
S/32-bit numbers that have their MSB set/zero extended 32-bit numbers/
(Ie. they don’t have there MSB set but _unset_!)

Here is a 32-bit number with its MSB set: 0x82341234

Sign extended: 0xFFFFFFFF82341234

Zero extended: 0x0000000082341234

You can’t load the zero extended version on RV64 with LUI+ADDI.
But you can load it with LUI+ADDIWU.

That is exactly what I had in mind but was not what I read from your text.

Page 42 Section 2.12.4
I think the extraction sequence should be
slli rd rs1 (-(pos + len)) % XLEN
srli rd rd (-len) % XLEN

I think taking the remainder of a negative number is a bad idea in many contexts, but especially in something like a specification, considering that in languages like C the behavior is implementation-defined.

Should have been more clear that we are dealing with _unsigned_ integers (like uint_xlen_t) and the % operator is perfectly well defined for "negative" integers . Systematically writing % XLEN just stresses the fact that shifts in RV have shift amount that is naturally thought of as (the equivalence class of) a number mod LEN with modular arithmetic mod LEN (I always thought this was clever and clean in RV).

More importantly, an expression like rs1 << (XLEN - rs2) is undefined in C for rs2 == 0, while in RV it does rs1 << 0, i.e. uses rs1 << ((-rs2) % XLEN)

this always makes sense, and does “the right thing” for 0 <= pos + len <= XLEN, pos >= 0, and len >= 0 except pos == len == 0 :-(.

I’m not even sure what you imply “the right thing” is for len=0, and pos!=0

It should return 0 as you extract an empty string.

.

I think the insertion sequence of a bit field of length len starting at bit 0 is wrong and should be
slli rd rs1 (-len) % XLEN
srli rd rd (-(len + pos)) % XLEN

I fixed the issue, but kept my form that does not require modulo of negative numbers.

See above , since they are unsigned integers no negatives just inverses mod 2^{XLEN} being

This also does “the right thing” except for pos == len == 0

Again, not seeing how leaving the register unchanged (left shift by zero, right shift by zero) is “the right thing”.

The elephant in the room is to what extent to standardise if compilers should recognise recommended fused sequences, what sequences to recommend, and to what extent compiler writers are supposed to respect these sequences.

There is no requirement for a compiler or for a processor to implement any of those.

The BFMAK macro definitions seems wrongly defined and I think it should be simply
bfmak rd, len, pos —> sloi rd, zero, len, c.slli rd rd pos

That’s just an equivalent definition. The difference is that your definition does not match the *-srli/*-srai fusion pattern, but the definition in the spec does.

And the description for that list of three pseudo-ops literally says “bitfield pseudo-ops for sr[la]i postfix fusion”, so I think the sequences should actually match that fusion pattern.

Please use ((-len-pos)%XLEN) instead of (XLEN - len -pos) and similarly for other shift amounts.

I’m not going to use C-like expressions that rely on implementation defined C behavior in a spec document.

Page 46: section 3.1.3
Select could be simplified with sbset

That’s correct. That example was written before we added SB* instructions. I’ve changed it now accordingly.

Page 46:
Can one use GREV instead of SHFL here?

I don’t think so. I wouldn’t know how.

One should really not have to use SHFL to pack bytes in a word. With PACK.BU and PACK.HU it becomes the straightforward

You are saying that as if it would be obvious what’s bad about using shuffle for that. It’s not obvious to me. Why do you think we need to invent a new instructions to do something that can already be done with SHFL?

Because I feel uncomfortable with SHFL :-) But seriously, SHFL is not in Zbb and It seems specialised to me. On the other hand packing two bytes in a halfword or two halfwords in a word, seems like the kind of thing that is needed for something bulk and common like packing network packets or RGBA data and perfectly reasonable to expect from Zbb.

Page 48: 3.1.6 Round to next power of two
The number 0 is not a power of two, but this may be conventional use anyway.

That’s why the text contains the clause “i.e. equivialent to rfill(x-1)+1.”

I was just pointing out that "round to next power of two" is, strictly speaking, mathematically incorrect and so the conventional behaviour 0 -> 0 is not mathematically correct behaviour.

Page52
The butterfly function is undefined and seems to depend on parameters other than just the stage number.

The first paragraph in that section defines what a stage-N butterfly operation is.

Do you mean the two instruction unnamed (apparently #define type ) macro at the beginning of section 4.3.1 in the v0.91 version ?

Then I think it would be clearer to name it there and make the a0, a1 a2 explicit in the parameters i.e.

#define butterfly(N, a0, a1, a2, a3)

  grevi a2, a0, (1 << N)
  cmix a0, a1, a2, a0

with the corresponding change in the sections below.

Page 60
I think you mean remainder instead of the (German) rest.

Oops. :D Fixed.

:-)

Page 61
These decode examples could probably use a C code version as a companion. How did you construct them ?

I created them with the use of an SMT solver.

OK I see. Perhaps worth adding, if only for the curious.

Regards,
- Clifford

Regards

Rogier

lkcl

unread,

Sep 4, 2019, 5:04:40 PM9/4/19

to RISC-V ISA Dev, rogier....@gmail.com

Hi, ok it is fantastic to see this review taking place, it is so long as an inline conversation that I am going to break the inline reply convention and summarise tge context.

Regarding the suggestion of extending the adduw/slluw etc to a whopping 82+ operations covering all permutations of 8/16/32 on all src and dest operands:

1. It conpletely breaks the RISC paradigm

2. *this is exactly what SimpleV's element width override is designed to provide*.

3. What you are suggesting, Rogier, despite there being good reasons, is exactly the sort of nightmare opcode proliferation behind SIMD. See "SIMD Considered harmful". SIMD is, at its heart, an O(n^6) opcode proliferation nightmare. Yes, really, 6 dimensions of options. You just hit 4 of them.

SimpleV provides a means to individually override the bitwidth of all and any of the operands, such that an instruction's behaviour covers 4x4x4-1 permutations that it otherwise could not possibly have.

It takes some setup, and was originally designed for vectorisation scenarios, where manipulating multiple packed byte (or 16 bit) elements would otherwise be prohibitively slow.

So in a single op case there is some overhead (32 bits). If however there are contiguous registers which require the same operation then that 32 bit overhead is a one-off cost for all the contiguous operations, and, furthermore, only a single instruction is needed to cover them all.

The actual SV elenent width override behaviour, when a vector is set (and VL=1) is that the top bits of the destination register must be left UNMODIFIED. This can save on about 2 to 3 instructions in packed structs which would otherwise require ORs and ANDs etc.

If however scalar mode is set and elwidth override set to 32 bit on the destination, ADD will do exactly what addu does, and ADD.W will do exactly what adduw does.

SV provides this functionality *WITH ZERO OPCODES ADDED*.

to reiterate: SV provides the functionality of adduw, subuw and slluw, and the full set of 82+ variants reccommended by Rogier, *even though there is not one single new instruction added by SV*.

Please consider that carefully as to whether the opcode space should be wasted, reading, absorbing and applying the lesson of "SIMD considered harmful".

Including whether adduw etc should be added at all (given particularly that SV provides them "for free")

I reiterate what I said earlier: BitManip has some very valuable and necessary opcodes, however compared to the original 0.35 and 0.36 xBitManip of 18 months ago, where Clifford's initial analysis was thorough, clear and extremely well justified, the "specialised" opcode proliferation present in the 0.90 spec which has little to no general purpose use-cases is extremely alarming and should be ringing massive alarm bells across the entire RISCV community.

The principles of RISC are that you pick the best suite of opcodes that are both easy to do in hardware yet also, for the majority of cases, give damn good power efficiency and compact code right across the computing industry.

That said: if it takes 82 additional opcodes to add something that is actually really useful and can be shown to significantly reduce code size in a large number of common cases, then there is something really badly wrong with the RISCV ISA (and with the whole RISC principle).

Rogier can it be shown how much code size would be reduced by those 82 opcodes, ans how common such opportunities are?

Is anyone able to assist with such an evaluation?

L.

Clifford Wolf

unread,

Sep 5, 2019, 6:12:36 AM9/5/19

to Rogier Brussee, RISC-V ISA Dev

Hi,

On Wed, Sep 4, 2019 at 9:37 PM Rogier Brussee <rogier....@gmail.com> wrote:

For many sections it is not quite clear what the W versions (for RV64) or D versions (for RV128) are supposed to do. I will try to indicate how I understand it works.

The general semantic and reasoning behind *W instructions is explained in the base ISA spec. They always just perform the RV32 operation on RV64. I don’t see it in the scope of the Bitmanip spec to explain *W instructions as a concept.

It is not the concept of W, which is perfectly clear, it is that you mention XLEN specifically in the C code (which should change to 32 for W) and uint_xlen_t (which should just stay what it is) and standards need to be excruciatingly explicit.

The bitmanip document currently states "Like in the base ISA, we add *W instruction variants on RV64 with the semantic of the matching RV32 instruction." That's sufficient imo.

Right. It zero extends because that is needed for the use case of building

struct foo{
short low, high;
}

I don't see anything in the calling convention that would require the upper bits of the register holding an aggregate of that type to have any specified value. So simply using PACKW to pack this struct on RV64 would be perfectly fine afaict.

Taking the absolute value of an integer is a generally useful operation and I think it is not needed to mention SAT solvers.

I’ve made the language a bit more generic. But note that even though it’s a generally useful operation, there are few applications that need it so frequently that it has measurable performance impacts. (That is of course in big part because calculating abs() isn’t very expensive even without min/max.) Therefore I think it does make sense to mention SAT solvers as a concrete application.

Maybe it is just me, but SAT solvers look like a bit of a niche application to me.

Well, I did make the language more generic, mentioning SAT solver as one application.

But worry not, I can assure you that min/max does not only help for SAT solvers. There have been some benchmarks showing significant speedup in specint just from MIN/MAX. But I can't reference that because it hasn't been properly published (yet?).

Page 8. Section 2.1.6 sbset/sbclr/sbinv/sbext
Please add:
These instructions efficiently support bitfields, and avoid costly constant constructions and nasty surprises especially if the bitfields are not in low bits.

In the context of bit manipulation instructions, the term “bitfield” is commonly used in the context of instructions that manipulate contiguous regions of bits. (See for example the ARM instructions BFC, BFI, SBFX, etc.)

These are single bit instructions, not bitfield instructions.

Good point. I meant bitfield in the sense of the C language, i.e. an example like

struct nasty{
unsigned long long counter:62;
unsigned flag1: 1; // ouch, have to manipulate bit 62 and 63, good luck if your code depends on nasty being read in one go.
unsigned flag2: 1:
}

Well, that's the special case of a bit field of size "1". The "counter" member is a bitfield of size 62.

(Jfyi, you want to use the "packed" attribute when doing something like that.)

Since you already say that they often are not faster than software emulation, and one of the first hits you get for PEXT is that AMD implements them horribly slow, maybe they should not be in B but still be defined as Z extension.

Did you even read the introduction section of the spec chapter?

Yes, and it has/had the BEXT /BDEP in the grey-shaded "suggested to be in B" block. But I guess one could implement Zbb instead.

I actually meant literally _read_. The text.

If I counted correctly that would mean adding 82 new R-type instructions. That’s over half of a minor opcode. You can’t be serious.

I am serious. That there are so many ADD's comes from making it two families and internal logic of the encoding that says that if you have an instruction land the instruction doesn't preserve sign extension you must also have the corresponding W and D versions and it is cheap to do so because the proper sign extension hardware is there anyway. But note that every single one of them corresponds directly to (one of the admitted many) C language + operations and add's are bread and butter so it is OK if it takes up room. I think avoiding nasty surprises and being able to support C operations essentially 1 to 1 in the ISA is an important goal. One could reduce the size by half by ditching the sl and perhaps the b/bu versions

I think macro-op fusion is the answer here, not defining 82 almost identical instructions.

The nice think about macro-op fusion is that it is completely optional. A small core might not bother, and a large one might do it, but both support the same ISA.

By defining a flood of small instructions we'd be forcing unnecessary complexity on small cores.

Page 42 Section 2.12.4
I think the extraction sequence should be
slli rd rs1 (-(pos + len)) % XLEN
srli rd rd (-len) % XLEN

I think taking the remainder of a negative number is a bad idea in many contexts, but especially in something like a specification, considering that in languages like C the behavior is implementation-defined.

Should have been more clear that we are dealing with _unsigned_ integers (like uint_xlen_t) and the % operator is perfectly well defined for "negative" integers . Systematically writing % XLEN just stresses the fact that shifts in RV have shift amount that is naturally thought of as (the equivalence class of) a number mod LEN with modular arithmetic mod LEN (I always thought this was clever and clean in RV).
More importantly, an expression like rs1 << (XLEN - rs2) is undefined in C for rs2 == 0, while in RV it does rs1 << 0, i.e. uses rs1 << ((-rs2) % XLEN)

We are dealing with pseudo code here. There's no formal definition in that part of the spec what C type "pos" and "len" are. But from the context I think the assumption that this are (gnu) assembler expressions is justified. And afaict there's no formal definition on this level of detail how those work (https://sourceware.org/binutils/docs/as/Integer-Exprs.html) wrt things like signedness. So why not just avoid it the whole issue and not use modulo in this way.

> this always makes sense, and does “the right thing” for 0 <= pos + len <= XLEN, pos >= 0, and len >= 0 except pos == len == 0 :-(.

I’m not even sure what you imply “the right thing” is for len=0, and pos!=0

It should return 0 as you extract an empty string.

Then your code does not do "the right thing" because your code will return the input unmodified (left shift by zero bits followed by right shift by zero bits).

One should really not have to use SHFL to pack bytes in a word. With PACK.BU and PACK.HU it becomes the straightforward

You are saying that as if it would be obvious what’s bad about using shuffle for that. It’s not obvious to me. Why do you think we need to invent a new instructions to do something that can already be done with SHFL?

Because I feel uncomfortable with SHFL :-) But seriously, SHFL is not in Zbb and It seems specialised to me. On the other hand packing two bytes in a halfword or two halfwords in a word, seems like the kind of thing that is needed for something bulk and common like packing network packets or RGBA data and perfectly reasonable to expect from Zbb.

Tbh I find it hard to identify a common theme behind you arguments. On the one hand you say ROL/ROR is not really needed because it can be emulated with three instructions for the case of immediate shift amount, and four instructions for the case of variable shift amount, and on the other hand you want a special instruction for RD = (RS1&255) | ((RS2&255)<<8) in Zbb. (And in many practical cases the "&255" is not actually needed.)

Page52
The butterfly function is undefined and seems to depend on parameters other than just the stage number.

The first paragraph in that section defines what a stage-N butterfly operation is.

Do you mean the two instruction unnamed (apparently #define type ) macro at the beginning of section 4.3.1 in the v0.91 version ?

Then I think it would be clearer to name it there and make the a0, a1 a2 explicit in the parameters i.e.

#define butterfly(N, a0, a1, a2, a3)
  grevi a2, a0, (1 << N)
  cmix a0, a1, a2, a0
with the corresponding change in the sections below.

The text above the two instructions literally reads "The following macro performs a stage-N butterfly operation on the word in a0 using the mask in a1."

So I don't know where you get the idea that it is unnamed, or that it's unclear that a0 and a1 are arguments.

regards,

- Clifford

hao sun

unread,

Sep 5, 2019, 8:38:11 PM9/5/19

to RISC-V ISA Dev, rogier....@gmail.com

Hi Clifford,

I am reading the B extension spec, which is really impressive since it can have use cases for each instruction.

Thanks so much for your efforts.

Thanks

Hao Sun

在 2019年9月4日星期三 UTC+8下午8:36:22，clifford写道：

Clifford Wolf

unread,

Sep 6, 2019, 2:12:59 AM9/6/19

to Rogier Brussee, RISC-V ISA Dev

Hi,

reflecting on Rogier's review I am now proposing two new instructions, REPACK[I] and PACKB:

https://github.com/riscv/riscv-bitmanip/issues/45

please feel free to comment on github.

regards,

- Clifford

Rogier Brussee

unread,

Sep 6, 2019, 7:48:42 AM9/6/19

to RISC-V ISA Dev, rogier....@gmail.com

Op donderdag 5 september 2019 12:12:36 UTC+2 schreef clifford:

Hi,

On Wed, Sep 4, 2019 at 9:37 PM Rogier Brussee <rogier...@gmail.com> wrote:
For many sections it is not quite clear what the W versions (for RV64) or D versions (for RV128) are supposed to do. I will try to indicate how I understand it works.

The general semantic and reasoning behind *W instructions is explained in the base ISA spec. They always just perform the RV32 operation on RV64. I don’t see it in the scope of the Bitmanip spec to explain *W instructions as a concept.

It is not the concept of W, which is perfectly clear, it is that you mention XLEN specifically in the C code (which should change to 32 for W) and uint_xlen_t (which should just stay what it is) and standards need to be excruciatingly explicit.

The bitmanip document currently states "Like in the base ISA, we add *W instruction variants on RV64 with the semantic of the matching RV32 instruction." That's sufficient imo.

If that unambiguously defines the W (or mutatis mutandis, D) version yes. Just to be explicit, I would add that the result is sign extended unless mentioned otherwise.

Right. It zero extends because that is needed for the use case of building

struct foo{
short low, high;
}

I don't see anything in the calling convention that would require the upper bits of the register holding an aggregate of that type to have any specified value. So simply using PACKW to pack this struct on RV64 would be perfectly fine afaict.

fair enough!

I just saw that you now have proposed a PACKB instruction. Great! The PACK family of instructions is now exactly PACKB, PACKH (RV32, RV64, RV128), PACKW (RV64, RV128) and PACKD (RV128) instructions i.e. compared to my proposal, just sign rather than zero extend and named and encoded differently. If that is more convenient/efficient/acceptable, fine.

Taking the absolute value of an integer is a generally useful operation and I think it is not needed to mention SAT solvers.

I’ve made the language a bit more generic. But note that even though it’s a generally useful operation, there are few applications that need it so frequently that it has measurable performance impacts. (That is of course in big part because calculating abs() isn’t very expensive even without min/max.) Therefore I think it does make sense to mention SAT solvers as a concrete application.

Maybe it is just me, but SAT solvers look like a bit of a niche application to me.

Well, I did make the language more generic, mentioning SAT solver as one application.

My point was purely psychological. Justifying something with something else which can easily be perceived as niche tends to undermine your argument.

But worry not, I can assure you that min/max does not only help for SAT solvers.

There have been some benchmarks showing significant speedup in specint just from MIN/MAX. But I can't reference that because it hasn't been properly published (yet?).

I am not at all surprised and the fact that saturated instructions exist and are commonly needed gives strong indication of their usefulness. I also always thought it was a wart in the ISA that AMO's can do MAX and MIN on memory but that MAX and MIN on registers was not an instruction.

Page 8. Section 2.1.6 sbset/sbclr/sbinv/sbext
Please add:
These instructions efficiently support bitfields, and avoid costly constant constructions and nasty surprises especially if the bitfields are not in low bits.

In the context of bit manipulation instructions, the term “bitfield” is commonly used in the context of instructions that manipulate contiguous regions of bits. (See for example the ARM instructions BFC, BFI, SBFX, etc.)

These are single bit instructions, not bitfield instructions.

Good point. I meant bitfield in the sense of the C language, i.e. an example like

struct nasty{
unsigned long long counter:62;
unsigned flag1: 1; // ouch, have to manipulate bit 62 and 63, good luck if your code depends on nasty being read in one go.
unsigned flag2: 1:
}

Well, that's the special case of a bit field of size "1". The "counter" member is a bitfield of size 62.

(Jfyi, you want to use the "packed" attribute when doing something like that.)

May well be. I am not sure I even want to know the details. https://lwn.net/Articles/793253/

Since you already say that they often are not faster than software emulation, and one of the first hits you get for PEXT is that AMD implements them horribly slow, maybe they should not be in B but still be defined as Z extension.

Did you even read the introduction section of the spec chapter?

Yes, and it has/had the BEXT /BDEP in the grey-shaded "suggested to be in B" block. But I guess one could implement Zbb instead.

I actually meant literally _read_. The text.

Yeah I read it before my holiday. Then I wrote that after my holliday and just checked with the diagram.

If I counted correctly that would mean adding 82 new R-type instructions. That’s over half of a minor opcode. You can’t be serious.

I am serious. That there are so many ADD's comes from making it two families and internal logic of the encoding that says that if you have an instruction land the instruction doesn't preserve sign extension you must also have the corresponding W and D versions and it is cheap to do so because the proper sign extension hardware is there anyway. But note that every single one of them corresponds directly to (one of the admitted many) C language + operations and add's are bread and butter so it is OK if it takes up room. I think avoiding nasty surprises and being able to support C operations essentially 1 to 1 in the ISA is an important goal. One could reduce the size by half by ditching the sl and perhaps the b/bu versions

I think macro-op fusion is the answer here, not defining 82 almost identical instructions.

The nice think about macro-op fusion is that it is completely optional. A small core might not bother, and a large one might do it, but both support the same ISA.

By defining a flood of small instructions we'd be forcing unnecessary complexity on small cores.

Ok let me explain my reasoning. You may, of course, still hate the idea.

There are really two families of instructions

The addsl/subsl ( [ADD SUB]SL<k>) (which includes regular ADD) and add./sub. ([ADD SUB].[BHWD][U]) (which includes ADDWU) families

First lets see where the number 82 comes from

For reference:

#add #addw #addd
addsl1 addwsl1 adddsl1
addsl2 addwsl2 adddsl2
addsl3 addwsl3 adddsl3
addsl4 addwsl4 adddsl4
addsl5 addwsl5 adddsl5
addsl6 addwsl6 adddsl6
addsl7 addwsl7 adddsl7
add.b addw.b addd.b
add.h addw.h addd.h
add.w addd.w
add.d
add.bu addwu.bu adddu.bu
add.hu addwu.hu adddu.hu
add.wu addwu adddu.wu
add.du adddu
With the pattern repeating for sub.

So we get

RV 32 RV64 RV 128

addsl 7 14 21

subs 7 14 21

add. 4 11 20

sub. 4 11 20

---------------------------------------------

total 22 50 82

Hence small cores are not facing 82, but 22 instructions .

As mentioned earlier, my gut feeling is that SUB's are not terribly important and could easily be dropped cutting the total numbers in half, but they come along cheaply in the package deal. This is because the instructions are all very closely related and the whole slew is reusing the same HW and a lot already exists:

*The difference between add and sub is the use of the existing inverter on rs2 which is controlled by the existing 1 bit.

*The difference between the different shifts can reuse the three lowest stage existing shifters and is controlled by three bits.

*The difference between add. and sub. is the existing inverter or rs2 controlled by the existing 1 bit

*The difference between the sl and . versions is controled by a bit

*The difference between the signed and unsigned versions is controlled by 1 bit

*The difference between .B/.H/.W/.D is the existing (for load) zero or sign extension on rs2 from bit 7 15 31 or 64 and controlled by 2 bits, (one of which is always 0 on RV32)

*The difference between the XLEN and W (or D) versions is existing sign extension on rd controlled by the existing difference in major opcode

==Can this be done with fusion?==

Short answer Yes.

Long answer:

The easy case is addsl.

addsl<k> rd rs1 rs2

is fusion of

slli rd rs2 k

c.add rd rd rs1

The ADD.*U instructions are also easy. E.g.

add.hu rd rs1 rs2

is fusion of

packh rd rs2 zero #assume packh is a macro with the proper pack for the architecture

c.add rd rd rs1

The ADD.* instructions are slightly more difficult but still feasible

add.h rd rs1 rs2

is fusion of

slli rd rs2 -16

srli rd rd -16

c.add rd rd rs1

===Why would you not do fusion then===

For ADDSL

* An SLLI; ADD fusion is the number one on everybody's list of fusion targets because address computation is so common that I suspect very few processors will not want to implement it (and they still can!)

* If they are "expected to" provide the functionality anyway, having a regular instruction addsl<k> should be _easier_ on low end processors than having to implement fusion decoding.

* The fusion pattern is a fusion of two RVC instructions only in the case that rd == rs2 (the index).

For unsigned ADD.*U

* Note that ADD.WU is just a different name for ADDU.W and ADDWU (aka ADDWU.WU) is the same as ADDWU in the proposal.

*An ZEXTH ADD fusion would be high on people's list of fusion targets for much the same reason that addu.w and addwu is in the B proposal: you get lots of ZEXTH/PACKH instructions if people use unsigned short instead of int (this is bad practice, but there is plenty of code that does), and it would implement addition of unsigned shorts which by the calling convention are zero extended and whose result is an unsigned int (the compiler would, in rare cases where it cannot prove overflow does not matter when adding unsigned shorts also have to also insert a SEXT somewhere because unsigned integers are sign extended). Admittedly a compiler could probably remove lots of the PACK's but they could still lead to nasty surprises

* again a regular instruction should be easier on the low end

* the fusion pair is always 2 bytes longer.

For signed ADD.*

* In theory ints and signed shorts are represented as properly sign extended. Compilers may still insert sign extensions and in any case results must be properly sign extend even when using longer registers. For operations on words among themselves, the W instruction takes care and in fact there is a complete doubling of the arithmetic instructions instead of a sign extension instruction. Just having sign extension should be enough to avoid the worst surprises with signed shorts and signed chars. However,

* There is no single instruction (macro) SEXT.B or SEXT.H similar to ZEXT.B and ZEXT.H as provided by PACK and ADDI. In fact ADD.B and ADD.H provide sign extension and by being in combination with add you cover the by far most important case.

* The fusion triple is always 16 bit longer and can only use RVC instructions if rd == rs2 and is a popular register.

===Encoding===

Since everything is so regularly transferrable to the W and D versions I only do the regular version. The bits are supposed to be found somewhere in the func7 of the ADD instruction with ADD and SUB having their already defined encoding, i.e. reusing the existing inverter selecting bit.

#add 00000
addsl1 00001
addsl2 00010
addsl3 00011
addsl4 00100
addsl5 00101
addsl6 00110
addsl7 00111
add.bu 01000
add.hu 01001
add.wu 01010 (RV64)
add.du 01011 (RV128)

add.b 01100
add.h 01101
add.w 01110 (RV64)
add.d 01111 (RV128)

#sub 10000
subsl1 10001
subsl2 10010
subsl3 10011
subsl4 10100
subsl5 10101
subsl6 10110
subsl7 10111
sub.bu 11000
sub.hu 11001
sub.wu 11010 (RV64)

sub.du 11011 (RV128)

sub.b 11100
sub.h 11101
sub.w 11110. (RV64)
sub.d 11111 (RV128)

Page 42 Section 2.12.4
I think the extraction sequence should be
slli rd rs1 (-(pos + len)) % XLEN
srli rd rd (-len) % XLEN

I think taking the remainder of a negative number is a bad idea in many contexts, but especially in something like a specification, considering that in languages like C the behavior is implementation-defined.

Should have been more clear that we are dealing with _unsigned_ integers (like uint_xlen_t) and the % operator is perfectly well defined for "negative" integers . Systematically writing % XLEN just stresses the fact that shifts in RV have shift amount that is naturally thought of as (the equivalence class of) a number mod LEN with modular arithmetic mod LEN (I always thought this was clever and clean in RV).
More importantly, an expression like rs1 << (XLEN - rs2) is undefined in C for rs2 == 0, while in RV it does rs1 << 0, i.e. uses rs1 << ((-rs2) % XLEN)

We are dealing with pseudo code here. There's no formal definition in that part of the spec what C type "pos" and "len" are. But from the context I think the assumption that this are (gnu) assembler expressions is justified. And afaict there's no formal definition on this level of detail how those work (https://sourceware.org/binutils/docs/as/Integer-Exprs.html) wrt things like signedness. So why not just avoid it the whole issue and not use modulo in this way.

> this always makes sense, and does “the right thing” for 0 <= pos + len <= XLEN, pos >= 0, and len >= 0 except pos == len == 0 :-(.

I’m not even sure what you imply “the right thing” is for len=0, and pos!=0

It should return 0 as you extract an empty string.

Then your code does not do "the right thing" because your code will return the input unmodified (left shift by zero bits followed by right shift by zero bits).

mmh if pos = 3 and len = 0 we get

SLLI rd rs1 -3

SRLI rd rs1 0

which shifts the 3 least significant bits to the top and leaves them there. Definitely not "the right thing" :-(.

One should really not have to use SHFL to pack bytes in a word. With PACK.BU and PACK.HU it becomes the straightforward

You are saying that as if it would be obvious what’s bad about using shuffle for that. It’s not obvious to me. Why do you think we need to invent a new instructions to do something that can already be done with SHFL?

Because I feel uncomfortable with SHFL :-) But seriously, SHFL is not in Zbb and It seems specialised to me. On the other hand packing two bytes in a halfword or two halfwords in a word, seems like the kind of thing that is needed for something bulk and common like packing network packets or RGBA data and perfectly reasonable to expect from Zbb.

Tbh I find it hard to identify a common theme behind you arguments. On the one hand you say ROL/ROR is not really needed because it can be emulated with three instructions for the case of immediate shift amount, and four instructions for the case of variable shift amount, and on the other hand you want a special instruction for RD = (RS1&255) | ((RS2&255)<<8) in Zbb. (And in many practical cases the "&255" is not actually needed.)

Page52
The butterfly function is undefined and seems to depend on parameters other than just the stage number.

The first paragraph in that section defines what a stage-N butterfly operation is.

Do you mean the two instruction unnamed (apparently #define type ) macro at the beginning of section 4.3.1 in the v0.91 version ?
Then I think it would be clearer to name it there and make the a0, a1 a2 explicit in the parameters i.e.

#define butterfly(N, a0, a1, a2, a3)
  grevi a2, a0, (1 << N)
  cmix a0, a1, a2, a0
with the corresponding change in the sections below.
The text above the two instructions literally reads "The following macro performs a stage-N butterfly operation on the word in a0 using the mask in a1."

So I don't know where you get the idea that it is unnamed, or that it's unclear that a0 and a1 are arguments.

It is not _explicitly_ named as a macro but as the name of an (abstract) and it is not clear if the a0 a1 a2 are place holder registers (as you sometimes use them) or that they are explicitly these registers, in particular wether these registers are implicitly used in the next step (which to make matters more confusing is the same operation).

regards,
- Clifford

Regards

Rogier

Rogier Brussee

unread,

Sep 6, 2019, 8:08:53 AM9/6/19

to RISC-V ISA Dev, rogier....@gmail.com

Op woensdag 4 september 2019 23:04:40 UTC+2 schreef lkcl:

Hi, ok it is fantastic to see this review taking place

The bad thing is nobody with more expertise did.

, it is so long as an inline conversation that I am going to break the inline reply convention and summarise tge context.
Regarding the suggestion of extending the adduw/slluw etc to a whopping 82+ operations covering all permutations of 8/16/32 on all src and dest operands:

It is 22 on RV32 , 11 if you drop support for SUB (which is not expensive) and 4 if you also drop the shifted versions and prefer fusion.

1. It conpletely breaks the RISC paradigm

see my reply to Clifford. It is meant to be a pragmatic solution to the fact that while B and H are second class citizens, some people use (unsigned) shorts and signed bytes anyway, extending how ADDU.W and ADDWU are pragmatic solutions to the fact that people use unsigned ints.

lkcl

unread,

Sep 7, 2019, 6:16:31 AM9/7/19

to RISC-V ISA Dev, rogier....@gmail.com

On Friday, September 6, 2019 at 1:08:53 PM UTC+1, Rogier Brussee wrote:

Op woensdag 4 september 2019 23:04:40 UTC+2 schreef lkcl:
Hi, ok it is fantastic to see this review taking place

The bad thing is nobody with more expertise did.

which has me quite concerned. bitmanip is a complex strategic area. clifford's original work (0.35, 0.36), including a proper static analysis audit, was superb. i think it's safe to say that he's getting a little overwhelmed and overloaded.

i'd like to see other people help out, here.

, it is so long as an inline conversation that I am going to break the inline reply convention and summarise tge context.
Regarding the suggestion of extending the adduw/slluw etc to a whopping 82+ operations covering all permutations of 8/16/32 on all src and dest operands:

It is 22 on RV32 , 11 if you drop support for SUB (which is not expensive) and 4 if you also drop the shifted versions and prefer fusion.

still has me concerned. SV can achieve the packing with SIXTY PERCENT LESS instructions used compared to bitmanip (assuming 32-bit ops. with so little space in RVC, the argument for putting SHFL and PACK into RVC would need to be *really* watertight. i.e.: unlikely).

and remember that's with 4 8-bit values to pack. if there were more values to pack, it's the same SV instruction count.

section 4.1.4 (vdraft, probably same as in 0.91):

    pack a0, a0, a1
    pack a1, a2, a3
    shfl a0, a0, 8
    shfl a1, a1, 8
    pack a0, a0, a1

which is 5x 32-bit words, or 10x 16-bit words (20 bytes).

in SV this is:

    VBLK.regs[0] = {fmt:INT, elwidth:8, isvec:true, regidx:a0}
    VBLK.regs[0] = {fmt:INT, elwidth:default, isvec:true, regidx:a4}
    VBLK.setvl = 4
    C.MV a0, a4

that's 4x 16-bit words (8 bytes).

16 bits for the VBLOCK header, 16 for the element-width override on a0 and a4 (in the compact "8" bit format), 16 to specify VL=4, and a Compressed MV to do the copy *for all four registers*. if "unexpanding" the above operation it's equivalent to:

C.MV a0[7..0] = a4[7..0]

C.MV a0[15..8] = a5[7..0]

C.MV a0[23..16] = a6[7..0]

C.MV a0[31..24] = a7[7..0]

the reason it is equivalent is because SV requires element-width overrides to "typecast" the register file as if it was a static byte-addressable SRAM. c-equivalent:

typedef union {
    uint8_t actual_reg_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
    uint8_t b[0]; // array of type uint8_t
    uint16_t s[0]; // array of type uint16_t
    uint32_t i[0]; // array of type uint32_t
    uint64_t l[0]; // etc.
    uint128_t d[0]; // etc.
} reg_t;

reg_t int_regfile[128];

with the src reg having a different element width, the "rules" covering truncation/zero/sign-extension apply to simply drop the top bits of each of a0-a3.

1. It conpletely breaks the RISC paradigm

see my reply to Clifford. It is meant to be a pragmatic solution to the fact that while B and H are second class citizens, some people use (unsigned) shorts and signed bytes anyway, extending how ADDU.W and ADDWU are pragmatic solutions to the fact that people use unsigned ints.

mmm.... i feel that this is a slippery-slope argument. to explain that, let's take an analogy:

"people want video processing, therefore SIMD is a pragmatic solution to the fact that people do a lot of 8/16-bit bit manipulation".

i appreciate that SV is not for everybody. however its sequential-byte-level manipulation, due to this "typecasting" effect, in combination with the ability to use *existing* OP32 and RVC opcodes, results in an unprecedented level of instruction compacting that is only rivalled by the Mill Architecture's "tagged" polymorphic ISA.

BitManip is supposed to be about *bit* manipulation, which is why i was so excited by the original 0.35 and 0.36 drafts. shuffle, grev, clz, pcnt and so on - these are really difficult to do efficiently with byte/word/dword level instructions and so are easily justifiable and worthwhile.

since then: single-bit-set, these are incredibly useful on embedded platforms, particularly when it comes to GPIO set/clear operations. without these single-bit instructions, you actually have to have special memory-mapped IO registers, where the same underlying I/O latch is mapped *three times* in different locations: one for the "real" value, one for "ORing", one for "CLEARing (&=~xxxx)

likewise: although its encoding is slightly awkward, i can see a strong case for bfp. again, embedded systems, peripheral management in the linux kernel - being able to drop in a range of bits into a register? that's really really powerful.

*byte* level manipulation on the other hand? no. if there are *byte* level instructions being added, here, to me that's indicative that the *bit* level manipulation has either not been thought through or not designed correctly in the first place.

example: pack? mmm... no. i bet... if bfp was extended to 64 bit, you could actually use bfp instead. actually... hang on... the layout of bfp is as follows:

The layout of the control word in rs2 is as follows. LEN=0 encodes for LEN=16.
    | 3                    2                   1                    |
    |1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
    |-------------------------------|-------------------------------|
    |       | LEN |       OFF       |              DATA             |
    |---------------------------------------------------------------|

my question is: OFFS+LEN is 12 bits. that will fit into an immediate. therefore, why is there no bfpi?

if bfpi existed, then the sequence in 4.1.4 could become errr...

BFPI a4, a0, {len=8,off=0)

BFPI a4, a1, {len=8,off=8)

BFPI a4, a2, {len=8,off=16)

BFPI a4, a3, {len=8,off=24)

in other words, PACK *is* BFPI, and a lot more powerful *than* PACK! yes, PACK can do 32-bit where BFPI is restricted to 16-bit - however if you have a BFPI2 which can do the same job except that len is multiplied by 2, then PACK's entire justification is *really* kicked in the teeth.

yes, i realise that i am suggesting that the following:

PACK

BFP

be replaced with:

BFP

BFPI # immediate version of BFP (works in RV64)

BFPI2 # immediate version of BFPI (works in RV64, covers 2-32 bit LENs)

however the case should be pretty damn clear.

and, interestingly, sbset is a pseudo-code op for BFP, and sbsetI is a pseudo-code op for BFPI and sbclr. but not sbinv or sbext.

which makes the case for *replacing* sbinv with BFINV and *replacing* sbext with BFEXT

the only reason for keeping sbset/clr/inv would be due to the fact that in the reg-variants, there would potentially be an overhead in setting up rs2, placing a "1" in the LEN field. however, actually, if you think about it:

(a) there's a bug in bfp: it cannot access 16 bits, it can only access *15* bits.

(b) this may be fixed by adding 1 to len.

if LEN is offset by 1 in BFP, then that implies that as a pseudo-code op for SBSET, you do *not* have to OR in a length field into rs2, you only have to OR in the offset.

so to summarise:

1) SV provides byte-level sequential manipulation partly by accident, partly by design.

2) bitmanip needs to focus on *bit* manipulation, providing *byte* and *word* level manipulation as a side-effect

3) pack needs to go.

4) BFP can be extended and become much more powerful, covering its functionality and offering way more besides, replacing pack as well as the entire single-bit range of opcodes.

5) the only case for retaining sbset (etc.) would be down to the overhead of setting up rs2, however BFP's len may be offset by 1, to compensate.

l.

lkcl

unread,

Sep 7, 2019, 6:21:12 AM9/7/19

to RISC-V ISA Dev, rogier....@gmail.com

On Friday, September 6, 2019 at 7:12:59 AM UTC+1, clifford wrote:

Hi,

reflecting on Rogier's review I am now proposing two new instructions, REPACK[I] and PACKB:
https://github.com/riscv/riscv-bitmanip/issues/45

please feel free to comment on github.

with apologies, clifford: as you are no doubt aware, github has an "F" on the GNU hosting criteria:

https://www.gnu.org/software/repo-criteria-evaluation.en.html#GitHub

asking people to use it puts unethical pressure on software libre developers, placing them in an impossible position of having to abandon principles or simply not contribute to the discussion at all.

i trust that it was not your intention - in any way - to force *anyone* into such a position, and that this was simply an oversight.

i will raise this on the code-of-dishonour thread because there has been no response on the suggestion to provide an email-to-github gateway, which would easily solve this problem and allow libre developers to interact and be welcome peer-level contributors.

l.

Rogier Brussee

unread,

Sep 8, 2019, 2:48:48 PM9/8/19

to RISC-V ISA Dev, rogier....@gmail.com

Op zaterdag 7 september 2019 12:16:31 UTC+2 schreef lkcl:

On Friday, September 6, 2019 at 1:08:53 PM UTC+1, Rogier Brussee wrote:

Op woensdag 4 september 2019 23:04:40 UTC+2 schreef lkcl:
Hi, ok it is fantastic to see this review taking place

The bad thing is nobody with more expertise did.

which has me quite concerned. bitmanip is a complex strategic area. clifford's original work (0.35, 0.36), including a proper static analysis audit, was superb. i think it's safe to say that he's getting a little overwhelmed and overloaded.

i'd like to see other people help out, here.

, it is so long as an inline conversation that I am going to break the inline reply convention and summarise tge context.
Regarding the suggestion of extending the adduw/slluw etc to a whopping 82+ operations covering all permutations of 8/16/32 on all src and dest operands:

It is 22 on RV32 , 11 if you drop support for SUB (which is not expensive) and 4 if you also drop the shifted versions and prefer fusion.

still has me concerned. SV can achieve the packing with SIXTY PERCENT LESS instructions used compared to bitmanip (assuming 32-bit ops. with so little space in RVC, the argument for putting SHFL and PACK into RVC would need to be *really* watertight. i.e.: unlikely).

and remember that's with 4 8-bit values to pack. if there were more values to pack, it's the same SV instruction count.

section 4.1.4 (vdraft, probably same as in 0.91):

Clifford has now also come to the idea that PACKB is a good idea so packing 4 bytes in 4 different registers a0... a3 in a0 (sign extending the result) now becomes three instructions

PACKB a0 a0 a1

PACKB t0 a2 a3

PACKH a0 a0 t0

where PACKH is PACK on RV32 and PACKW on RV64 or RV128

[snip long example to show that the machinery of Simple V can also be used to pack 4 bytes in a word]

1. It conpletely breaks the RISC paradigm

see my reply to Clifford. It is meant to be a pragmatic solution to the fact that while B and H are second class citizens, some people use (unsigned) shorts and signed bytes anyway, extending how ADDU.W and ADDWU are pragmatic solutions to the fact that people use unsigned ints.

mmm.... i feel that this is a slippery-slope argument.

I don't agree but OK.

to explain that, let's take an analogy:

"people want video processing, therefore SIMD is a pragmatic solution to the fact that people do a lot of 8/16-bit bit manipulation".

My proposal has absolutely nothing to do with SIMD but are two essentially separate proposals for

1. Supporting signed and unsigned chars and signed and unsigned shorts, signed and unsigned ints and signed and unsigned long long's and input_t , and their interactions with signed and unsigned ints and signed and unsigned long longs and intptr_t by supporting add's between them and in the process provide sign extension and zero extension for bytes, shorts, ints, and long long's (the zero extension can also be done with PACK). Sign and zero extension will take care of every other operation because this is where the slope starts. The only possible exception is SUB because it is so cheap to support. Any other sign and zero extension can be done just fine with the SLLI SR[AL]I trick.

2. Support the most common fused shift/add as a regular instruction because it is easier to implement than fusion, less restrictive in the registers that can be used and/or shorter, and a single instruction is slightly easier to fuse with a load/store instruction.

While ensuring that the whole ADD package can be densely encoded.

[Snip] SV

BitManip is supposed to be about *bit* manipulation, which is why i was so excited by the original 0.35 and 0.36 drafts. shuffle, grev, clz, pcnt and so on - these are really difficult to do efficiently with byte/word/dword level instructions and so are easily justifiable and worthwhile.

Think of BitManipulation as manipulating units that are sub-register.size inside a register

since then: single-bit-set, these are incredibly useful on embedded platforms, particularly when it comes to GPIO set/clear operations. without these single-bit instructions, you actually have to have special memory-mapped IO registers, where the same underlying I/O latch is mapped *three times* in different locations: one for the "real" value, one for "ORing", one for "CLEARing (&=~xxxx)

likewise: although its encoding is slightly awkward, i can see a strong case for bfp. again, embedded systems, peripheral management in the linux kernel - being able to drop in a range of bits into a register? that's really really powerful.

*byte* level manipulation on the other hand? no. if there are *byte* level instructions being added, here, to me that's indicative that the *bit* level manipulation has either not been thought through or not designed correctly in the first place.

example: pack? mmm... no. i bet... if bfp was extended to 64 bit, you could actually use bfp instead. actually... hang on... the layout of bfp is as follows:

The layout of the control word in rs2 is as follows. LEN=0 encodes for LEN=16. | 3 2 1 | |1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0| |-------------------------------|-------------------------------| | | LEN | OFF | DATA | |---------------------------------------------------------------|

my question is: OFFS+LEN is 12 bits. that will fit into an immediate. therefore, why is there no bfpi?

Where would you leave "data" inside what is now rs2 ?

[snip] Argument that ignores "data" and proposes two immediate instructions

(a) there's a bug in bfp: it cannot access 16 bits, it can only access *15* bits.

len==0 means 16 bit.

[SNIP]

2) bitmanip needs to focus on *bit* manipulation, providing *byte* and *word* level manipulation as a side-effect

I disagree. You could bikeshed, however, that Zbb should be called something like Zba for basic ALU operations.

3) pack needs to go.

I totally disagree.

4) BFP can be extended and become much more powerful, covering its functionality and offering way more besides, replacing pack as well as the entire single-bit range of opcodes.

I don't think that would work, see above.

lkcl

unread,

Sep 8, 2019, 5:44:04 PM9/8/19

to RISC-V ISA Dev, rogier....@gmail.com

On Monday, September 9, 2019 at 2:48:48 AM UTC+8, Rogier Brussee wrote:
> Op zaterdag 7 september 2019 12:16:31 UTC+2 schreef lkcl:
>
>
> On Friday, September 6, 2019 at 1:08:53 PM UTC+1, Rogier Brussee wrote:
>
>
> Op woensdag 4 september 2019 23:04:40 UTC+2 schreef lkcl:Hi, ok it is fantastic to see this review taking place
>
>
> The bad thing is nobody with more expertise did.
>
>
> which has me quite concerned. bitmanip is a complex strategic area. clifford's original work (0.35, 0.36), including a proper static analysis audit, was superb. i think it's safe to say that he's getting a little overwhelmed and overloaded.
>
>
> i'd like to see other people help out, here.
>
>
> , it is so long as an inline conversation that I am going to break the inline reply convention and summarise tge context.
> Regarding the suggestion of extending the adduw/slluw etc to a whopping 82+ operations covering all permutations of 8/16/32 on all src and dest operands:
>
>
> It is 22 on RV32 , 11 if you drop support for SUB (which is not expensive) and 4 if you also drop the shifted versions and prefer fusion.
>
>
> still has me concerned. SV can achieve the packing with SIXTY PERCENT LESS instructions used compared to bitmanip (assuming 32-bit ops. with so little space in RVC, the argument for putting SHFL and PACK into RVC would need to be *really* watertight. i.e.: unlikely).
>
>
>
> and remember that's with 4 8-bit values to pack. if there were more values to pack, it's the same SV instruction count.
>
>
> section 4.1.4 (vdraft, probably same as in 0.91):
>
>
>
> Clifford has now also come to the idea that PACKB is a good idea so packing 4 bytes in 4 different registers a0... a3 in a0 (sign extending the result) now becomes three instructions
>
>
> PACKB a0 a0 a1
> PACKB t0 a2 a3
> PACKH a0 a0 t0

3x32bit, 6x 16bit words.. not bad.

>
> where PACKH is PACK on RV32 and PACKW on RV64 or RV128

Which gives a hierarchical tree O ( log n ) number of instructions depending on pack width.

With SV it's flat. 8 bytes, regardless of pack size.

Of course that is not true if packing variable sized elements, mixed 16 with 32 with 8.

Ok, I like PACKB and PACKW now. I get it.

> 1. It conpletely breaks the RISC paradigm
>
>
> see my reply to Clifford. It is meant to be a pragmatic solution to the fact that while B and H are second class citizens, some people use (unsigned) shorts and signed bytes anyway, extending how ADDU.W and ADDWU are pragmatic solutions to the fact that people use unsigned ints.

Yep got it. And that they could needed a lot due to the RISCV function call API conventions.

Like it.

>
> mmm.... i feel that this is a slippery-slope argument.
>
>
> I don't agree but OK.

It's ok, I get the use case now.

> My proposal has absolutely nothing to do with SIMD but are two essentially separate proposals for
>
>
> 1. Supporting signed and unsigned chars and signed and unsigned shorts, signed and unsigned ints and signed and unsigned long long's and input_t , and their interactions with signed and unsigned ints and signed and unsigned long longs and intptr_t by supporting add's between them and in the process provide sign extension and zero extension for bytes, shorts, ints, and long long's (the zero extension can also be done with PACK). Sign and zero extension will take care of every other operation because this is where the slope starts. The only possible exception is SUB because it is so cheap to support. Any other sign and zero extension can be done just fine with the SLLI SR[AL]I trick.

Thie looks like something that is a description of important use-cases that would need to go into the bitmanip appendix.

Can you think of some examples that illustrate? (i am saying i am having difficulty processing the verbal desceiption)

> 2. Support the most common fused shift/add as a regular instruction because it is easier to implement than fusion, less restrictive in the registers that can be used and/or shorter, and a single instruction is slightly easier to fuse with a load/store instruction.

Again apologies can you elaborate with some sequences?

>
>
>
> While ensuring that the whole ADD package can be densely encoded.

>

> Think of BitManipulation as manipulating units that are sub-register.size inside a register

Yes. Beyond... yes, sub register.

My point is that SIMD was designed as a way to take standard arithmetic ops down to the byte level (and puts parallelism on top of that)

Where the regfile effectively becomes a typecast array of bytes, or words, or halfwords.

And we know that SIMD is bad (order N^6 opcode proliferation).

If either SV or RVV begin to cover bytelevel and wordlevel, then adding specific opcodes to bitmanip that overlap with their capabilities begins to waste opcode space.

What you said however about the mix and match of PACKB and PACKW means that *in this instance* neither SV nor RVV can cover that efficiently, so the red flag? Hurriedly being hidden sheepishly behind my back and nonchalantly whistling ;)

>
> my question is: OFFS+LEN is 12 bits. that will fit into an immediate. therefore, why is there no bfpi?
>
>
>
>
> Where would you leave "data" inside what is now rs2 ?

Sigh yes small brainmelt, i had missed rs2 when looking at the various Rv types. Even if taking the horrible precedent of using funct7 and funct3 for imm that is only 10 bits which is not enough.

So,.ignore bfpi idea.

> (a) there's a bug in bfp: it cannot access 16 bits, it can only access *15* bits.
>
>
> len==0 means 16 bit.

Let me think that through....

Okaaay so 16 modulo 16 ends up at 0 which is easy math and could be done with AND masking 0xf and gets the desired value.

clever.

Ok I like the bfp encoding, now.

> 2) bitmanip needs to focus on *bit* manipulation, providing *byte* and *word* level manipulation as a side-effect
>
>
>
>
>
> I disagree. You could bikeshed, however, that Zbb should be called something like Zba for basic ALU operations.

Mmm... will have to reread the sets

>
>
>
>
> 3) pack needs to go.
>
>
> I totally disagree.
>

Yeah me too, now :)

I can see it being really useful if combined with SV, for packing large arrays of bytes/words.

One of the awesome things about SV is that the regfile is typecast. So if there is a regular array of mixed byte/word structs that needs processing or packing, we can throw parallel PACKB and PACKH etc operations at the data.

Can't skip registers unfortunately, they have to be contiguous, or use LD offsets to force the bytes (etc) to be contuguously loaded.

Discussion for another time. Happy with PACKB, PACKH and BFI.

L.

Rogier Brussee

unread,

Sep 9, 2019, 2:13:35 PM9/9/19

to RISC-V ISA Dev, rogier....@gmail.com

Op zondag 8 september 2019 23:44:04 UTC+2 schreef lkcl:

[snip]

>
> PACKB a0 a0 a1
> PACKB t0 a2 a3
> PACKH a0 a0 t0

3x32bit, 6x 16bit words.. not bad.
>
> where PACKH is PACK on RV32 and PACKW on RV64 or RV128

Which gives a hierarchical tree O ( log n ) number of instructions depending on pack width.

exactly.

[snip]

> 1. It conpletely breaks the RISC paradigm
>
>
> see my reply to Clifford. It is meant to be a pragmatic solution to the fact that while B and H are second class citizens, some people use (unsigned) shorts and signed bytes anyway, extending how ADDU.W and ADDWU are pragmatic solutions to the fact that people use unsigned ints.

Yep got it. And that they could needed a lot due to the RISCV function call API conventions.

Like it.

Good!

>
> mmm.... i feel that this is a slippery-slope argument.
>
>
> I don't agree but OK.

It's ok, I get the use case now.

> My proposal has absolutely nothing to do with SIMD but are two essentially separate proposals for
>
>
> 1. Supporting signed and unsigned chars and signed and unsigned shorts, signed and unsigned ints and signed and unsigned long long's and input_t , and their interactions with signed and unsigned ints and signed and unsigned long longs and intptr_t by supporting add's between them and in the process provide sign extension and zero extension for bytes, shorts, ints, and long long's (the zero extension can also be done with PACK). Sign and zero extension will take care of every other operation because this is where the slope starts. The only possible exception is SUB because it is so cheap to support. Any other sign and zero extension can be done just fine with the SLLI SR[AL]I trick.

Thie looks like something that is a description of important use-cases that would need to go into the bitmanip appendix.

Can you think of some examples that illustrate? (i am saying i am having difficulty processing the verbal desceiption)

char foo[]


void copy_foo(char* buf, unsigned short pos, unsigned short len)
{
//fools rush in where angels fear to tread
for (unsigned short  i = pos; i < pos + len; ++i)
    buf[i] = foo[i];
}


copy_foo:
   la t0 foo
   addw t1 a1 a2        #two unsigned shorts add to an unsigned _int_ assume they are properly zero extended already

   bgeu a0 t1 endloop

loop:
   add.hu t2 t0 a1      #do the adres calculation of foo[i] treating i/a1 as unsigned short! 
   lbu t2 0(t2)
   add.hu t3 a0 a1      #do the adres calculation of buf[i] treating i/a1 as unsigned short!
   sb  t2 0(t3)
   addwi a1 a1 1        #we don't care about overflow here
   bltu a0 t1 loop

endloop:

   ret

Essentially the same problems that you have with unsigned int's on RV64, you have with unsigned shorts and unsigned bytes on RV32. Of course it is not good practice to use unsigned shorts, it just happens anyway. For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.

> 2. Support the most common fused shift/add as a regular instruction because it is easier to implement than fusion, less restrictive in the registers that can be used and/or shorter, and a single instruction is slightly easier to fuse with a load/store instruction.

Again apologies can you elaborate with some sequences?

this one is easy:

struct foo **foo;     #a0
size_t n;             #a1

struct foo* foo_n;    #a5


foo_n = foo[n];


addsl3 a5 a0 a1     //assume RV64 so a pointer is 8 byte
ld     a5 0(a5)

> What you said however about the mix and match of PACKB and PACKW means that *in this instance* neither SV nor RVV can cover that efficiently, so the red flag? Hurriedly being hidden sheepishly behind my back and

> nonchalantly whistling ;)

:-)

[snip]

[> 3) pack needs to go.

>
>
> I totally disagree.
>

Yeah me too, now :)

I can see it being really useful if combined with SV, for packing large arrays of bytes/words.

One of the awesome things about SV is that the regfile is typecast. So if there is a regular array of mixed byte/word structs that needs processing or packing, we can throw parallel PACKB and PACKH etc operations at the data.

PACKB is boring and so are ADD.HU, SEXTH, or ADDSL3. I would say that in this case that is a _good_ thing.

[skip]

Discussion for another time. Happy with PACKB, PACKH and BFI.

L.

lkcl

unread,

Sep 9, 2019, 2:36:41 PM9/9/19

to RISC-V ISA Dev

Sorry this went under the wrong thread.

gorc seems to simply be the following:

grev t0, t1, rs2
or t0, t0, t1

Is that correct? If so, what significant general purpose savings are there which justify the additional opcode?

If an extremely commonly used algorithm exists for which gorc would reduce code size by significant margins compared to those two opcodes (grev, or) it would be good to list it.

Otherwise gorc is better off as a macro op fusion.

Or, is there something mathematically diffferent about the ORing at each layer where the partial results cascade and you get different answers?

L.

Ray Van De Walker

unread,

Sep 9, 2019, 2:59:01 PM9/9/19

to RISC-V ISA Dev

For a limited set of CRCs, I would have picked the CCITT CRCs myself: CRC32 & CRC16.

Can’t larger CPUs do CRC32C, etc. with the Galois multiplies in the security subset?

So, only very small embedded CPUs need CRC instructions. They often need to communicate, thus the CCITT CRCs.

The applications in the Wikipedia CRC article seems to support this.

Jacob Lifshay

unread,

Sep 9, 2019, 5:02:23 PM9/9/19

to Rogier Brussee, RISC-V ISA Dev

On Mon, Sep 9, 2019, 11:13 Rogier Brussee <rogier....@gmail.com> wrote:

Essentially the same problems that you have with unsigned int's on RV64, you have with unsigned shorts and unsigned bytes on RV32. Of course it is not good practice to use unsigned shorts, it just happens anyway. For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.

Do note that there are languages such as Rust (which is being increasingly used for embedded and systems programming as a replacement for C and C++) where signed and unsigned overflow is never undefined behavior -- it is defined to always wrap (release mode) or cause a panic (debug mode) (like a C++ exception but only used for fatal logic errors). Rust also has explicit operations that always wrap even in debug mode.

Jacob

Rogier Brussee

unread,

Sep 10, 2019, 1:31:03 AM9/10/19

to RISC-V ISA Dev, rogier....@gmail.com

[snip]

Op maandag 9 september 2019 20:13:35 UTC+2 schreef Rogier Brussee:

>
> 1. Supporting signed and unsigned chars and signed and unsigned shorts, signed and unsigned ints and signed and unsigned long long's and input_t , and their interactions with signed and unsigned ints and signed and unsigned long longs and intptr_t by supporting add's between them and in the process provide sign extension and zero extension for bytes, shorts, ints, and long long's (the zero extension can also be done with PACK). Sign and zero extension will take care of every other operation because this is where the slope starts. The only possible exception is SUB because it is so cheap to support. Any other sign and zero extension can be done just fine with the SLLI SR[AL]I trick.

Thie looks like something that is a description of important use-cases that would need to go into the bitmanip appendix.

Can you think of some examples that illustrate? (i am saying i am having difficulty processing the verbal desceiption)

char foo[] void copy_foo(char* buf, unsigned short pos, unsigned short len) { //fools rush in where angels fear to tread for (unsigned short i = pos; i < pos + len; ++i) buf[i] = foo[i]; } copy_foo: la t0 foo addw t1 a1 a2 #two unsigned shorts add to an unsigned _int_ assume they are properly zero extended already bgeu a0 t1 endloop loop: add.hu t2 t0 a1 #do the adres calculation of foo[i] treating i/a1 as unsigned short! lbu t2 0(t2) add.hu t3 a0 a1 #do the adres calculation of buf[i] treating i/a1 as unsigned short! sb t2 0(t3) addwi a1 a1 1 #we don't care about overflow here bltu a0 t1 loop endloop: ret

Essentially the same problems that you have with unsigned int's on RV64, you have with unsigned shorts and unsigned bytes on RV32. Of course it is not good practice to use unsigned shorts, it just happens anyway. For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.

[snip]

I borked the assembler I see. First there is a trivial bug in that I mixed up a0 (char* buf) and a1 (unsigned short pos) in the test so it should have been

copy_foo:
   la t0 foo
   addw t1 a1 a2        #two unsigned shorts add to an unsigned _int_ assume they are properly zero extended already



   bgeu a1 t1 endloop

loop:


   add.hu t2 t0 a1      #do the adres calculation of foo[i] treating i/a1 as unsigned short! 
   lbu t2 0(t2)
   add.hu t3 a0 a1      #do the adres calculation of buf[i] treating i/a1 as unsigned short!
   sb  t2 0(t3)
   addwi a1 a1 1        #we don't care about overflow here


   bltu a1 t1 loop

endloop:

   ret

More importantly though, I walked right into the mine field



   addwi a1 a1 1        #AARGH we very much _DO_ care about overflow here
   bltu a0 t1 loop      #Because  t1 may be MAXSHORT + 1

so if your honest (and I try to be that), for this particular example add.hu has no benefit and becomes.

loop:
   add t2 t0 a1      #do the adres calculation of foo[i] treating i/a1 as size_t because we already zero extended! 
   lbu t2 0(t2)
   add t3 a0 a1      #do the adres calculation of buf[i] treating i/a1 as size_t because we already zero extended !

   sb  t2 0(t3)
   addwi a1 a1 1

     
   zexth a1 a1       #we zero extend
   bltu a1 t1 loop   #because we DO care about overflow here

endloop:

The point stands though that indexing with a signed or unsigned short that the compiler cannot tell (or does not bother because there are too many minefields) is normalised add.hu is useful

unsigned short pos;
char foo[];
 
usigned short idx = pos + 1000;
char c = foo[idx];

gives

la t0 foo


addwi t1 a1 1000 
add.hu t0 t0 t1
lbu t0 0(t0)

Ciao

Rogier

Rogier Brussee

unread,

Sep 10, 2019, 1:53:13 AM9/10/19

to RISC-V ISA Dev, rogier....@gmail.com

Op maandag 9 september 2019 23:02:23 UTC+2 schreef Jacob Lifshay:

On Mon, Sep 9, 2019, 11:13 Rogier Brussee <rogier...@gmail.com> wrote:
Essentially the same problems that you have with unsigned int's on RV64, you have with unsigned shorts and unsigned bytes on RV32. Of course it is not good practice to use unsigned shorts, it just happens anyway. For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.

Do note that there are languages such as Rust (which is being increasingly used for embedded and systems programming as a replacement for C and C++) where signed and unsigned overflow is never undefined behavior -- it is defined to always wrap (release mode)

More reason to have a solution for working with 16 or 8 bits integers that makes second rather than third class citizens. Note that the Rust approach suggests an add with 16bit modular arithmetic i.e. a fuse of

add rd rs1 rs2

zexth rd rd

for u16 and

add rd rs1 rs2

sexth rd rd

for s16

However, one would hope that usage of 16 bit integers in a way that actually hurts, comes from old legacy code.

Rogier Brussee

unread,

Sep 14, 2019, 4:07:30 PM9/14/19

to RISC-V ISA Dev

Op maandag 9 september 2019 20:36:41 UTC+2 schreef lkcl:

Sorry this went under the wrong thread.
gorc seems to simply be the following:
grev t0, t1, rs2
or t0, t0, t1
Is that correct?

No it is not, I think. For example grevi rd rs1 7 reverses the bits in each byte. So if rs1 == 1<<7 then with

grevi rd rs1 7 #assume rd != rs1

or rd rd rs1

after these instruction rd == 1<<7 + 1

on the other hand

gorc rd rs1 7

seems to work on a vector of bytes inside the XLEN register and does the byte vector version of the 0 /-1 version of the not equal predicate, so if rs1 == 1<<7, then after the instruction rd == 1<<8 - 1

If so, what significant general purpose savings are there which justify the additional opcode?
If an extremely commonly used algorithm exists for which gorc would reduce code size by significant margins compared to those two opcodes (grev, or) it would be good to list it.

This means you can use it for doing strlen (and other str* functions) whole registers at the time. E.g.

uint_xlen_t x = ...;

if(gorc(x,7) != -1)

tests for x containing a zero byte, and

uint_xlen_t x = ...;

n = ctz(~gorc(x,7)) / 8

finds the index of the first zero byte in x or is XLEN/8 if there is no zero byte.

lkcl

unread,

Sep 14, 2019, 5:27:04 PM9/14/19

to RISC-V ISA Dev

On Sunday, September 15, 2019 at 4:07:30 AM UTC+8, Rogier Brussee wrote:

> This means you can use it for doing strlen (and other str* functions) whole registers at the time. E.g.
>
>
> uint_xlen_t x = ...;
>
>
> if(gorc(x,7) != -1)
>
>
> tests for x containing a zero byte, and
>
>
>
> uint_xlen_t x = ...;
>
>
> n = ctz(~gorc(x,7)) / 8
>
>
> finds the index of the first zero byte in x or is XLEN/8 if there is no zero byte.

Nice!

That would be good to see in the spec as an example.

Thanks for clarifying, Rogier. I was going to write a mini implementation of the two and compare the output.

L.

Clifford Wolf

unread,

Sep 16, 2019, 3:46:51 AM9/16/19

to Rogier Brussee, RISC-V ISA Dev

Hi,

On Mon, Sep 9, 2019 at 8:13 PM Rogier Brussee <rogier....@gmail.com> wrote:

[..] For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.

That's not entirely true. Because, as you have stated correctly, overflow for signed data types is undefined in C, it is safe for a compiler to not sign-extend in most cases involving signed data types. There's no need to adhere to the calling convention once undefined behavior is triggered. So only fast zero extension is required, which is provided via "pack[w] x, x, zero" for "short" and "andi x, x, 255" for "char". (Other languages that have defined behavior for signed overflow might require two shifts to sign-extend, which may be one instruction on architectures that macro-op-fuse such sequences.)

regards,

- Clifford

Rogier Brussee

unread,

Sep 18, 2019, 9:43:40 AM9/18/19

to RISC-V ISA Dev, rogier....@gmail.com

Op maandag 16 september 2019 09:46:51 UTC+2 schreef clifford:

Hi,

On Mon, Sep 9, 2019 at 8:13 PM Rogier Brussee <rogier...@gmail.com> wrote:
[..] For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.

That's not entirely true. Because, as you have stated correctly, overflow for signed data types is undefined in C, it is safe for a compiler to not sign-extend in most cases involving signed data types. There's no need to adhere to the calling convention once undefined behavior is triggered.

I was just going to uncomfortably agree with you when I realised that I was wrong about the overflow.

Disclaimer: I am not a language lawyer (and have no ambitions in that direction). However what I wrote below seems to be backed up by

https://stackoverflow.com/questions/24371868/why-must-a-short-be-converted-to-an-int-before-arithmetic-operations-in-c-and-c

and the references to the standard therein

There is no undefined behaviour for the adding shorts because arithmetic on shorts first does a conversion to an int and then adds the ints, and only at assignments do you do i.e.

extern short a,b;

short c = a + b;

is equivalent to

short c = (short)((int)a + (int)b);

and should compile to something like

lh a0 %a(gp). #a0 contains properly sign extended h so trivially cast to int

lh a1 %b(gp) #a1 contains properly sign extended h so trivially cast to int

addw a2, a0, a1 #add as ints

slli a2, a2, (-16U). #sign extend from bit 15

srai a2, a2, (-16U) #hopefully fused

Of course the compiler is allowed to operate under the "as if" rule if it can prove that sign extension is not needed for example when storing using sh, and it can leave intermediate results unnormalised. However, it does seem to have to byte the bullet before passing a short to a function and ensure it is properly sign extended and can, I think, not invoke undefined behavior.

So only fast zero extension is required, which is provided via "pack[w] x, x, zero" for "short" and "andi x, x, 255" for "char". (Other languages that have defined behavior for signed overflow might require two shifts to sign-extend, which may be one instruction on architectures that macro-op-fuse such sequences.)

I still think that some version of addh (either pre sign extending one of the inputs or post sign extending the result or both) is the way to go, but at least sign extension for halves and bytes seems to address a real need. Sign extending other sizes are, I think, properly addressed by addwi, adddi, or the slli srai trick.

lkcl

unread,

Sep 18, 2019, 3:06:37 PM9/18/19

to RISC-V ISA Dev, rogier....@gmail.com

On Wednesday, September 18, 2019 at 2:43:40 PM UTC+1, Rogier Brussee wrote:

Op maandag 16 september 2019 09:46:51 UTC+2 schreef clifford:
Hi,

On Mon, Sep 9, 2019 at 8:13 PM Rogier Brussee <rogier...@gmail.com> wrote:
[..] For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.

That's not entirely true. Because, as you have stated correctly, overflow for signed data types is undefined in C, it is safe for a compiler to not sign-extend in most cases involving signed data types. There's no need to adhere to the calling convention once undefined behavior is triggered.

I was just going to uncomfortably agree with you when I realised that I was wrong about the overflow.

Disclaimer: I am not a language lawyer (and have no ambitions in that direction). However what I wrote below seems to be backed up by

https://stackoverflow.com/questions/24371868/why-must-a-short-be-converted-to-an-int-before-arithmetic-operations-in-c-and-c

there is someone on comp.arch who could be tracked down under the huuuuge recent thread "signed/unsigned c++" who really knows his stuff, here. i believe his name is David Brown. of all the people you could ask about undefined behaviour in c and c++ he will be able to quote chapter and verse on the c++ standards *and* their historical development.

l.

Bruce Hoult

unread,

Sep 18, 2019, 7:46:54 PM9/18/19

to Rogier Brussee, RISC-V ISA Dev

On Sat, Sep 14, 2019 at 1:07 PM Rogier Brussee <rogier....@gmail.com> wrote:
> gorc rd rs1 7
>
> seems to work on a vector of bytes inside the XLEN register and does the byte vector version of the 0 /-1 version of the not equal predicate, so if rs1 == 1<<7, then after the instruction rd == 1<<8 - 1
>

> This means you can use it for doing strlen (and other str* functions) whole registers at the time. E.g.
>
> uint_xlen_t x = ...;
>
> if(gorc(x,7) != -1)
>
> tests for x containing a zero byte, and
>
> uint_xlen_t x = ...;
>
> n = ctz(~gorc(x,7)) / 8
>
> finds the index of the first zero byte in x or is XLEN/8 if there is no zero byte.

Yes, exactly. And gorc(x,3) finds zero nybbles, gorc(x,1) finds zero
(aligned) pairs of bits, gorc(x,15) finds zero shorts etc. Of course
preconditioning x with an xor or subtract allows you to find any
arbitrary value, not just zero.

Also, on a 64 bit machine gorc(0xab, 7<<3) will create the value
0xabababababababab, and in fact the same result is produced by any
gorc(0xab << (8*i), 7<<3). Similarly gorc(0xabcd, 15<<2) results in
0xabcdabcdabcdabcd, as does gorc(0xabcd0000, 15<<2) or even
gorc(0xbcda000, 15<<2) which allows a repeating 16 bit pattern to be
generated using just an LUI+GORC.

If you have GREV already, then GREV+GORC has very low incremental cost.

Small machines that don't want the full cost of GREV+GORC (similar to
a barrel shifter) can implement only the most useful combinations
using simpler circuitry .. for example bit reversal, byte reversal
(endian conversion), zero byte detection .. but use the same opcodes
as full GREVI/GORCI.

Bruce Hoult

unread,

Sep 18, 2019, 11:03:08 PM9/18/19

to Rogier Brussee, RISC-V ISA Dev

On Mon, Sep 2, 2019 at 1:55 PM Rogier Brussee <rogier....@gmail.com> wrote:
> Review of Xbitmanip V0.9
>
> Rogier Brussee
>
> 2019-09-03

A general note: not everything in V0.9 (or V0.91, which was already
current when Roger wrote this) will make it into the final spec.
Things in V0.91 now are things that appear to be worth *evaluating*
for their potential contribution to code size, execution speed, or
both. That evaluation will probably be negative in at least some
cases. The evaluation is taking place now and will be for probably
several months.

As a personal opinion I expect that we will end up with:

- a "B" extension aimed at being virtually automatically included in
all future application processors, that is those aimed at running full
OSes such as Linux, and currently implementing RV64GC. As such, it
should include only things that provide a tangible benefit for general
code, don't have a large cost in gates, and in particular don't impact
the CPU cycles time -- there's no point in getting a 0.1% decrease in
cycles if your achievable clock speed drops by 1%. I think some things
the current draft shows as being in B will or should fall out of it,
perhaps including such things as pcnt, slo/sro, rol/ror, and
grev/gorc/[un]shfl. "B" might well be seen as too expensive by makers
of microcontrollers.

- a set of Zbxxx extensions that provide a large benefit to at least
some applications, which implementors can pick and choose. Each Zbxxx
extension contains related instructions -- that are synergistic in use
and perhaps in sharing circuitry also. Each Zbxxx extension is
assigned official opcodes so that everyone uses the same ones and
standard toolchains can be shared.

> General remarks:
>
>> Great that the B extension is moving again and making progress. Special thanks to Clifford for pushing this for a long time, must have been an awful lot of work.

Seconded!

I'll try not to comment on things that Clifford already addressed.

> It would be useful if a rationale for the use of an instruction would be given and quantitative data on use were available.

Yes. We also need provenance of each instruction, as Dave Patterson
has done for the base instruction set.

> Comment: I would propose a further criterium:
>
> Support compilers implementing the standard C like language basic data types (pointers, _Bool/bool, char, [un]signed char, [unsigned] short, [unsigned] int, [unsigned] long long) and their basic operations without surprises regarding performance and/or code size, in particular for software written sub optimally or not written for RiscV e.g. using unsigned or short integers.

Agreed. The *wu and *u.w instructions fall into this category and
should eliminate a current RV64 weakness in dealing with unsigned int
loop counters and array indexes in LP64 code.

> Page 4. Section 2.1. Basic Manipulation instructions
>
> Comment: It seems to me that the B extension has moved to “integer instructions that extend or optimise the basic I instructions. I think this is a good thing, and I would argue that the B extension should position as slightly cheaper than the M extension, useful for basic compiler support, standard applications like bit banging in device drivers, networking, and very basic algorithms like hashing.(see also the remark on supporting the C/C++ language). Z extensions can then cater for more specialised purposes.

Agreed. I've been thinking for some time that the scope of B seems to
have expanded beyond "bit manipulation" -- and that this is a good
thing and it's good to include a few other things such as the *wu and
*u.w instructions in it. We should probably rename it, but I'm not
sure to what.

> Maybe it is useful to have a parity instruction even if this can be implemented with pcnt and &1, because it should use less power.

I think that may be worth defining an opcode, yes. Or at least
evaluating. Unary opcodes are cheap.

> Proposal:
>
> We have ctz(n) == pcnt(~n & (n -1)) == pcnt(~ (n |(-n))),
>
> A two register instruction
>
> cntz rd rs1 rs2 : rd = popcount(~ (rs1 | (-rs2)))
>
> Would allow pcnt, ctz and clz to be implemented as 3 macros with at most two instructions instead of 3 instructions
>
>
> pcnt rd rs1 : addi rd rs1 1; cntz rd zero rd
>
> ctz rd rs1 : cntz rd rs1 rs1
>
> clz rd rs1 : brev rd rs1; cntz rd rd rd

Interesting.

The problem is I believe pcnt is significantly more expensive than ctz
or clz and might not make the cut for generic B , while I'd think at
least one of clz/ctz would (and brev too).

> page 5 section 2.1.3. Logic with negate (andn, orn, xnor)

>
>
> Comment: I like that all basic logical operations have a “n” version.
>
> Is nand and nor not also worth it (the RVV mask instructions seem to have them for example) ?

I don't think so. Nand and nor can be done by negating the input of
the operation using their results. If both inputs of the dependent and
or or need negating then use DeMorgan.

> Proposal:
>
> I would propose the following additional logic instructions
>
>
> #and rd rs1 rs2 # rd = rs1 & rs2
>
> andn rd rs1 rs2 # rd = rs1 &~ rs2
>
> nand rd rs1 rs2 # rd = ~(rs1 & rs2)
>
> orn rd rs1 rs2 # rd = ~(rs1 &~ rs2) = ~rs1 | rs2
>
>
>
> and.ne rd rs1 rs2 # rd = (-!!rs1) & rs2 == (!!rs1) ? rs2 : 0
>
> andn.ne rd rs1 rs2 # rd = (-!!rs1) &~ rs2 == (!!rs1) ? ~rs2 : 0
>
> orn.eq rd rs1 rs2 # rd = ~((-!!rs1) & rs2) == (-!rs1) |~ rs2 == (!!rs1) ? ~rs2 : (-1)
>
> or.eq rd rs1 rs2 # rd = ~((-!!rs1) &~ rs2) == (-!rs1) | rs2 == (!!rs1) ? rs2 : (-1)
>
>
> #or rd rs1 rs2 # rd = rs1 | rs2
>
> nor rd rs1 rs2 # rd = ~(rs1 | rs2)
>
> #xor rd rs1 rs2 # rd = rs1 ^ rs2
>
> xnor rd rs1 rs2 # rd = ~(rs1 ^ rs2) == rs1 ^ rs2 ^ (-1) == rs1 ^ ~ rs2
>
>
> This would make AND a 3 bit family (i.e. a basic instruction with some “option flags” or small parameters, here nothing/~ on rs2, nothing/-!! on rs1, nothing/~ on rd) and OR and XOR a 1 bit family of instructions (nothing/~ on rd) that can be placed with the AND OR and XOR instructions themselves (i.e an instructions ANDN, NAND and AND.NE are supposed to have the same func3 values as AND being “AND”-like, but differ in a few bits in the funct7 space ).

My suspicion is this is all going well beyond the point of general
usefulness and benefit.

In fact I have my doubts that anything more than andn is justified by
actual frequency of use in general code. That's not to say we can't
evaluate the others.

> Page 9 Section 2.1.7 SLO/SRO
>
> Comment:
>
> I can see that these instructions are cheap on the alu . However I still don’t really see too much rationale for them and each uses a precious 7 bit immediate.
>
>
> The main application seems to be to construct a mask of imm 1’s i.e. (1 <<imm ) - 1, but the sequence
>
>
> li rd -1 ; srli rd rd (-imm % XLEN)
>
>
> Is just as long if rd is one of the popular registers x8—x15 and can be fused if needed.

I agree.

> While rol and ror are the easiest to explain permutation, they are also the most easily done with just shifts and OR’s. The butterfly network is also not really simpler than that for GREV, i.e slightly expensive. With the below SLLNEG/SRLNEG and more importantly, immediate versions, it is just 3 instructions. MIPS doesn’t have a ROL and ROR instruction. Are ROL and ROR actually worth it?

I agree, the benefit is marginal and only in pretty specialized code,
much of which would be much better accelerated with a proper SHA or
whatever instruction.

> Ternary operators have pushback because they are a burden on the micro architecture, and seem like the perfect candidate to replace by a fusable instruction sequence that can be recommended to compiler people, provided this is feasible.
>
>
>
> In this case two such sequences exist: since
>
>
> (rs1&rs2) | ((~rs1) & rs3) == (rs1 & rs2) ^ ((rs1 ^ (-1)) & rs3)
>
> == (rs1&rs2) ^ (rs1 & rs3) ^ rs3
>
> == (rs1 & (rs2 ^ rs3))) ^ rs3
>
>
> We can write
>
>
> CMIX rd rs1 rs2 rs3 # rs1&rs2 | (~rs1) & rs3, assume rd != rs1, rs3
>
> {
>
> XOR rd rs2 rs3
>
> AND rd rd rs1
>
> XOR rd rd rs3
>
> }
>
>
> and
>
>
> CMIX rd rs1 rs2 rs3 # rs1&rs2 | (~rs1) & rs3, assume rd != rs1, rs2
>
> {
>
> XOR rd rs2 rs3
>
> ANDC rd rd rs1
>
> XOR rd rd rs2
>
> }
>
>
> which gives some freedom for the compiler writer.

I like this a lot.

> Page 29 2.7.2
>
>
> Comment:
>
> We have the same issue with the ternary operator here of course.
>
>
> If we have the AND.NE instruction we can proceed as for the mix operator and define the fusable sequences
>
>
> CMOV rd rs1 rs2 rs3. # rs1 ? rs2 : rs3 assume rd != rs1, rs3
>
> {
>
> XOR rd rs2 rs3, assume rd != rs1, rs3
>
> AND.NE rd rs1 rd
>
> XOR rd rd rs3
>
> }

Ok, I'm warming to your AND.NE. But maybe not all its friends.

> Page 31 Section 2.8 Unsigned address calculation
>
>
> Comment: I like these instructions. Since adds are so dominant in usage and in particular in address calculations in most code they may well turn out the most used instructions of the B extension.

Yes. Which I think justifies the expansion of the definition of B.
Though you could sneak it in under the justification that bit
manipulations are traditionally regarded as being done on bags of bits
aka unsigned numbers, especially shifts.

> Likewise I think there is value in having pointer + integer instruction i.e. an instruction that absorbs the shift like LEA in x86 (one of the most commonly used arithmetic instructions on x86, although that is a bit misleading because it is also the 2 operand add on x86). People have been touting fusion of
>
> SLLI rd rs2 imm
> ADD rd rs1 rd
>
> but note that one cannot use the RVC version of SLLI unless rd == rs2 (the register containing the index) and for basic implementations an “already fused instruction”
>
> ADDSL<k> rd rs1 rs2 # SLLI rd rs2 <k> ; ADD rd rs1 rd
>
> actually seems easier to implement than fusion.

I like it, except:

1) isn't there a real danger that cascading a shift into an add
impacts the cycle time? Yes, some implementations could split it, but
the general RISC-V philosophy is fusion not splitting because this
puts the complexity burden on the already-complex implementations not
on the very simple implementations.

2) in most uses, which is in loops, the whole thing is usually
strength-reduced to ADDI rd, rd, 1<<k anyway.

> Page 32, Section 2.10 Future compressed instructions
>
>
> I would take this section out, or just suggest that C.NOT, C.NEG and C.ZEXT.W and C.ZEXT.D may be worthwhile additions. More quantitative data is needed however, and this just cannot be done just for the B extension. For example I think another instruction that may be worthwhile is
>
>
>

> C.LPC rd —> AUIPC rd 0. # load program counter

I like this, although note that C.JAL with an offset of 2 serves the
same purpose and could if necessary be recognized as a special case.
It would however force savinf of RA even in leaf functions that don't
otherwise require it.

lkcl

unread,

Sep 19, 2019, 1:57:30 AM9/19/19

to RISC-V ISA Dev, rogier....@gmail.com

On Thursday, September 19, 2019 at 12:46:54 AM UTC+1, Bruce Hoult wrote:

> n = ctz(~gorc(x,7)) / 8
>
> finds the index of the first zero byte in x or is XLEN/8 if there is no zero byte.

Yes, exactly. And gorc(x,3) finds zero nybbles, gorc(x,1) finds zero
(aligned) pairs of bits, gorc(x,15) finds zero shorts etc. Of course
preconditioning x with an xor or subtract allows you to find any
arbitrary value, not just zero.

so - really useful, then :)

Small machines that don't want the full cost of GREV+GORC (similar to
a barrel shifter) can implement only the most useful combinations
using simpler circuitry .. for example bit reversal, byte reversal
(endian conversion), zero byte detection .. but use the same opcodes
as full GREVI/GORCI.

ah. right. the problem with that is: it risks further fragmentation. what i would suggest instead, is: similar to the way that variable bitshift may be done with an FSM that simply blocks for multiple cycles but otherwise produces the correct answer, an algorithm be researched *and included* in the specification, which provides the *full* identical capability of GREVI/GORCI, is very efficient in terms of minimising gate count, but simply takes far more cycles to complete.

nobody recommends that DIV should only be done on power-of-two or other restricted subsets of operands: they develop FSM-based algorithms (jon dawson's FPU) which take 140 cycles to complete (!!) instead, and are happy (relieved) that they are compliant with the same hardware API as everyone else.

:)

l.

Bruce Hoult

unread,

Sep 19, 2019, 2:25:00 AM9/19/19

to lkcl, RISC-V ISA Dev, Rogier Brussee

On Wed, Sep 18, 2019 at 10:57 PM lkcl <luke.l...@gmail.com> wrote:
>> Small machines that don't want the full cost of GREV+GORC (similar to
>> a barrel shifter) can implement only the most useful combinations
>> using simpler circuitry .. for example bit reversal, byte reversal
>> (endian conversion), zero byte detection .. but use the same opcodes
>> as full GREVI/GORCI.
>
>
> ah. right. the problem with that is: it risks further fragmentation. what i would suggest instead, is: similar to the way that variable bitshift may be done with an FSM that simply blocks for multiple cycles but otherwise produces the correct answer, an algorithm be researched *and included* in the specification, which provides the *full* identical capability of GREVI/GORCI, is very efficient in terms of minimising gate count, but simply takes far more cycles to complete.

Such minimal gate count implementations are going to take longer to
run than simply compiling the C reference implementation in the manual
to the base ISA. You'd be better off to trap and emulate.

Zero nybble/byte/short detection only takes four base ISA instructions
(if you know in advance which one you are doing).. That however is
enough to get something like 20% worse Dhrystone benchmark results.
Few people buy CPUs to run Dhyrstone on them in production, but sadly
they do choose CPUs by Dhrystone numbers.

> nobody recommends that DIV should only be done on power-of-two or other restricted subsets of operands: they develop FSM-based algorithms (jon dawson's FPU) which take 140 cycles to complete (!!) instead, and are happy (relieved) that they are compliant with the same hardware API as everyone else.

This seems like a bad example given how many people do in fact leave
out the M extension and rely on shifts for power-of-two division and
library routines for others.

lkcl

unread,

Sep 19, 2019, 2:30:30 AM9/19/19

to RISC-V ISA Dev, rogier....@gmail.com

On Thursday, September 19, 2019 at 4:03:08 AM UTC+1, Bruce Hoult wrote:

A general note: not everything in V0.9 (or V0.91, which was already
current when Roger wrote this) will make it into the final spec.
Things in V0.91 now are things that appear to be worth *evaluating*
for their potential contribution to code size, execution speed, or
both. That evaluation will probably be negative in at least some
cases. The evaluation is taking place now and will be for probably
several months.

*sigh* - deep breath: i am aware that there are people who do not want to hear this: our team has not been given an opportunity to make recommendations for inclusion in BitManip, because the process utilises resources where it is *assumed* that they are acceptable to all.

the reasons why we cannot use those resources are not actually relevant: the fact that there is exclusion - discrimination - *at all* is what is not acceptable.

i have in good faith raised this *many times* and have not received a response.

moving on, i will assume that this is being investigated and that a response is being formulated which ensures that inclusory participation in the development and innovation of RISC-V is being taken seriously.

cycles if your achievable clock speed drops by 1%. I think some things
the current draft shows as being in B will or should fall out of it,
perhaps including such things as pcnt, slo/sro, rol/ror, and
grev/gorc/[un]shfl. "B" might well be seen as too expensive by makers
of microcontrollers.

[see previous message about investigation of alternative lower-gate-count but long-cycle blocking FSMs]

ok: PCNT, if excluded, would penalise SV, the scalar-to-parallelism API. these are opcodes taken *directly* from RVV's "Mask" operations, and PCNT is on the list:

https://libre-riscv.org/simple_v_extension/specification/bitmanip/

the justification for their inclusion is the *identical* reasoning as to why they were included in RVV. there's no additions, no variations: the only difference is that whilst RVV uses "zero/non-zero" on vector elements, SV requires *bits* in *scalar* registers to indicate "yes/no".

see below if not familiar with what the significance is - it boils down to the massive difference between sequential if/then operations and parallel predicated ones.

The problem is I believe pcnt is significantly more expensive than ctz
or clz and might not make the cut for generic B , while I'd think at
least one of clz/ctz would (and brev too).

again: a blocking version of pcnt can and should be developed, or at least provided as a pseudo-code Reference Guide for embedded implementors. being able to use the opcode *at all* can save more hassle for the programmer than the exclusion warrants.

given the modular nature of RISC-V, though: yes there's a case for simply excluding them and/or moving them to a sub-extension.

> page 5 section 2.1.3. Logic with negate (andn, orn, xnor)
>
>
> Comment: I like that all basic logical operations have a “n” version.
>
> Is nand and nor not also worth it (the RVV mask instructions seem to have them for example) ?

I don't think so. Nand and nor can be done by negating the input of
the operation using their results. If both inputs of the dependent and
or or need negating then use DeMorgan.

https://en.wikipedia.org/wiki/De_Morgan%27s_laws

what is the reason why they are in RVV? what is the key difference, here, that makes for a good justification as to why they should be left out, when RVV has a full 16-permutation of functions?

the answer is down to the interaction with the predicate masks, isn't it? the predicate masks interact with boolean logic (which i learned at school, and is commonly known as "DeMorgan"), in ways that *prevent* equivalency from being achieved, don't they?

so if those operations are left out, SV, which both parallelises *and* predicates scalar operations, would be penalised, wouldn't it?

> This would make AND a 3 bit family (i.e. a basic instruction with some “option flags” or small parameters, here nothing/~ on rs2, nothing/-!! on rs1, nothing/~ on rd) and OR and XOR a 1 bit family of instructions (nothing/~ on rd) that can be placed with the AND OR and XOR instructions themselves (i.e an instructions ANDN, NAND and AND.NE are supposed to have the same func3 values as AND being “AND”-like, but differ in a few bits in the funct7 space ).

My suspicion is this is all going well beyond the point of general
usefulness and benefit.

SV needs those operators. actually, it needs the full identical set to those included in RVV. the developers of RVV will have spent significant time and resources explaining and justifying their inclusion in RVV, therefore there is no need to go through the same process here.

In fact I have my doubts that anything more than andn is justified by
actual frequency of use in general code.

for *scalar* code, yes absolutely, i agree 100%. however for parallelised and predicated code - where SV provides that in hidden "context" on *TOP* of the BitManip scalar operations (as it does to absolutely all other scalar operations), it doesn't work out.

we have already had to add "predicate bit-negation" to compensate for the fact that you cannot selectively invert the arguments of a series of parallel "Compare" operations.

whilst in sequential code it is perfectly fine to invert the two operands or just change the order of the branch fail/success code, it is *not* fine to do that when it comes to parallelised code.

the rest of what you wrote, Bruce, looks reasonable and sensible, and applies equally to SV's benefit as it does to pure-scalar code.

so the question becomes: how may our team's constructive input - representing the needs of the 3D and Video Processing Hybrid CPU / GPU / VPU Community, behind the Open 3D Graphics Alliance that will be presented at the December RISC-V Conference - be respected and included?

l.

lkcl

unread,

Sep 19, 2019, 2:49:37 AM9/19/19

to RISC-V ISA Dev, luke.l...@gmail.com, rogier....@gmail.com

On Thursday, September 19, 2019 at 7:25:00 AM UTC+1, Bruce Hoult wrote:

> ah. right. the problem with that is: it risks further fragmentation. what i would suggest instead, is: similar to the way that variable bitshift may be done with an FSM that simply blocks for multiple cycles but otherwise produces the correct answer, an algorithm be researched *and included* in the specification, which provides the *full* identical capability of GREVI/GORCI, is very efficient in terms of minimising gate count, but simply takes far more cycles to complete.

Such minimal gate count implementations are going to take longer to
run than simply compiling the C reference implementation in the manual
to the base ISA. You'd be better off to trap and emulate.

bruuce: please. i appreciate that there is a habit that some people take with me, of coming up with a "contra-case", just because it's me. please, do let's nip that in the bud.

the idea that you propose, to leave it to "traps", would be a good one, if it weren't for the exclusion of the possibility of "intermediate" implementations which are half way between the two extremes.

this applies as much to the "simple" DIV algorithm, as well as augmented versions of the same algorithm which have a RADIX not equal to 1, as it does to GORC (etc.)

in the Libre RISC-V SOC's ALU, what we've done is to design all of the "processing" as combinatorial blocks, where even *at runtime* there are dynamic latches that open up and halve (or double) the pipeline length.

similar techniques may be applied even to FSMs to halve the completion time, obviously (in the case of an FSM) doubling the number of gates in the process. repeat the process (double again, halve the time again) and you have a "crude but fast and effective" way to give something that falls in between the two extremes, with the implementor having full control over the decision-making process.

we used a similar technique when working for Aspex Semiconductors on their 4096-parallel 2-bit-ALU system, by opening up (or closing) the gates between the ALUs, and substituting doubling of the string length for halving the completion time.

a "more advanced" implementor would of course (taking the DIV example again) just develop a RADIX-2 or RADIX-4 or RADIX-8 FSM rather than a single-bit FSM, yet, here, again, they would have the choice of which point to stop an [additional] doubling/halving and also the choice over which RADIX, in order to meet customer demand, on-demand.

now, can you predict *exactly* what future efficient GORC (etc.) algorithms will be deployed in the future, by innovative implementors? no, of course not. should they be *excluded* or prevented and prohibited from even *bothering* to do such innovation, purely from the basis that we cannot predict what they might or might not do? that does not seem fair - or wise.

so what i'm saying is: it's a bit more nuanced. and... *sigh*... means more work for the extension developers... sorry!

l.

Matthew Farkas-Dyck

unread,

Sep 19, 2019, 2:55:08 AM9/19/19

to Bruce Hoult, Rogier Brussee, RISC-V ISA Dev

On 18/09/2019, Bruce Hoult <bruce...@sifive.com> wrote:
> As a personal opinion I expect that we will end up with:

> I've been thinking for some time that the scope of B seems to
> have expanded beyond "bit manipulation" -- and that this is a good
> thing and it's good to include a few other things such as the *wu and
> *u.w instructions in it. We should probably rename it, but I'm not
> sure to what.

I strongly disagree it's good the scope of B expanded so far, and
ergo, strongly agree we should rename it ☺

I saw someone said we might want to omit grev and rol/ror — so we
claim we're defining the bit-twiddling extension, but omitting most
bit-twiddling. Which seems patently ridiculous. I think the core set
of bit-twiddling primitives — in particular: grev, shfl, rol/ror, clz,
ctz, pcnt — is very well-considered and would hate to lose it to a
grab-bag extension which will impose an undue burden on specialized
cores for bit-twiddling-heavy code.

I'm not averse to an extension to speed up various common operations
(e.g. (add, sub)wu, (add, sub, slli)u.w, pack), which can of course
overlap with B, but we ought to design it as such. I propose the name
U for this extension, as it can be considered to mean the "utility"
extension, and is phonetically the back-vowel counterpart of I.

Jacob Lifshay

unread,

Sep 19, 2019, 2:57:39 AM9/19/19

to Matthew Farkas-Dyck, Bruce Hoult, Rogier Brussee, RISC-V ISA Dev

I think U is already taken for user-mode.

Jacob

Jacob Lifshay

unread,

Sep 19, 2019, 3:02:19 AM9/19/19

to Matthew Farkas-Dyck, Bruce Hoult, Rogier Brussee, RISC-V ISA Dev

I'm not averse to an extension to speed up various common operations
(e.g. (add, sub)wu, (add, sub, slli)u.w, pack), which can of course
overlap with B, but we ought to design it as such. I propose the name
U for this extension, as it can be considered to mean the "utility"
extension, and is phonetically the back-vowel counterpart of I.
I think U is already taken for user-mode.

maybe Y for yet-more-int-ops? :P

Jacob

Rogier Brussee

unread,

Sep 19, 2019, 3:32:19 AM9/19/19

to RISC-V ISA Dev, rogier....@gmail.com

Op donderdag 19 september 2019 05:03:08 UTC+2 schreef Bruce Hoult:

On Mon, Sep 2, 2019 at 1:55 PM Rogier Brussee <rogier...@gmail.com> wrote:
> Review of Xbitmanip V0.9
>
> Rogier Brussee
>
> 2019-09-03

A general note: not everything in V0.9 (or V0.91, which was already
current when Roger wrote this)

Yeah, but not when I started the review :-).

will make it into the final spec.
Things in V0.91 now are things that appear to be worth *evaluating*
for their potential contribution to code size, execution speed, or
both. That evaluation will probably be negative in at least some
cases. The evaluation is taking place now and will be for probably
several months.

Very good!

As a personal opinion I expect that we will end up with:

- a "B" extension aimed at being virtually automatically included in
all future application processors, that is those aimed at running full
OSes such as Linux, and currently implementing RV64GC. As such, it
should include only things that provide a tangible benefit for general
code, don't have a large cost in gates, and in particular don't impact
the CPU cycles time -- there's no point in getting a 0.1% decrease in
cycles if your achievable clock speed drops by 1%. I think some things
the current draft shows as being in B will or should fall out of it,
perhaps including such things as pcnt, slo/sro, rol/ror, and
grev/gorc/[un]shfl. "B" might well be seen as too expensive by makers
of microcontrollers.

I agree, but for the Luke's comment that there is enough value in not splintering the ISA and there is such a need to have byte swap (and to a lesser exten bit reversal) that I think that grevi (or the functionally equivalent "grevi's with negative powers of two where bswap and brev correspond to one step in the network) is basically required.

- a set of Zbxxx extensions that provide a large benefit to at least
some applications, which implementors can pick and choose. Each Zbxxx
extension contains related instructions -- that are synergistic in use
and perhaps in sharing circuitry also. Each Zbxxx extension is
assigned official opcodes so that everyone uses the same ones and
standard toolchains can be shared.

> General remarks:
>
>> Great that the B extension is moving again and making progress. Special thanks to Clifford for pushing this for a long time, must have been an awful lot of work.

Seconded!

I'll try not to comment on things that Clifford already addressed.

OK. Thanks for having a look!

> It would be useful if a rationale for the use of an instruction would be given and quantitative data on use were available.

Yes. We also need provenance of each instruction, as Dave Patterson
has done for the base instruction set.

> Comment: I would propose a further criterium:
>
> Support compilers implementing the standard C like language basic data types (pointers, _Bool/bool, char, [un]signed char, [unsigned] short, [unsigned] int, [unsigned] long long) and their basic operations without surprises regarding performance and/or code size, in particular for software written sub optimally or not written for RiscV e.g. using unsigned or short integers.

Agreed. The *wu and *u.w instructions fall into this category and
should eliminate a current RV64 weakness in dealing with unsigned int
loop counters and array indexes in LP64 code.

Good!

> Page 4. Section 2.1. Basic Manipulation instructions
>
> Comment: It seems to me that the B extension has moved to “integer instructions that extend or optimise the basic I instructions. I think this is a good thing, and I would argue that the B extension should position as slightly cheaper than the M extension, useful for basic compiler support, standard applications like bit banging in device drivers, networking, and very basic algorithms like hashing.(see also the remark on supporting the C/C++ language). Z extensions can then cater for more specialised purposes.

Agreed. I've been thinking for some time that the scope of B seems to
have expanded beyond "bit manipulation" -- and that this is a good
thing and it's good to include a few other things such as the *wu and
*u.w instructions in it. We should probably rename it, but I'm not
sure to what.

How about Zba Basic ALU instructions.

> Maybe it is useful to have a parity instruction even if this can be implemented with pcnt and &1, because it should use less power.

I think that may be worth defining an opcode, yes. Or at least
evaluating. Unary opcodes are cheap.

Agreed.

Wild Idea, what is the use of a GXORC[I] xor combine similar to GORC? Clifford suggested CLMUL for such cases but that is not in Zbb (aka Zba), however.

> Proposal:
>
> We have ctz(n) == pcnt(~n & (n -1)) == pcnt(~ (n |(-n))),
>
> A two register instruction
>
> cntz rd rs1 rs2 : rd = popcount(~ (rs1 | (-rs2)))
>
> Would allow pcnt, ctz and clz to be implemented as 3 macros with at most two instructions instead of 3 instructions
>
>
> pcnt rd rs1 : addi rd rs1 1; cntz rd zero rd
>
> ctz rd rs1 : cntz rd rs1 rs1
>
> clz rd rs1 : brev rd rs1; cntz rd rd rd

Interesting.

The problem is I believe pcnt is significantly more expensive than ctz
or clz and might not make the cut for generic B , while I'd think at
least one of clz/ctz would (and brev too).

I didn't know this. But pcnt still seems pretty standard to have.

> page 5 section 2.1.3. Logic with negate (andn, orn, xnor)
>
>
> Comment: I like that all basic logical operations have a “n” version.
>
> Is nand and nor not also worth it (the RVV mask instructions seem to have them for example) ?

I don't think so. Nand and nor can be done by negating the input of
the operation using their results. If both inputs of the dependent and
or or need negating then use DeMorgan.

Obviously. The idea was that it would remove the need for a C.not and the hw cost woud be dominated by instruction decoding anyway.

> Proposal:
>
> I would propose the following additional logic instructions
>
>
> #and rd rs1 rs2 # rd = rs1 & rs2
>
> andn rd rs1 rs2 # rd = rs1 &~ rs2
>
> nand rd rs1 rs2 # rd = ~(rs1 & rs2)
>
> orn rd rs1 rs2 # rd = ~(rs1 &~ rs2) = ~rs1 | rs2
>
>
>
> and.ne rd rs1 rs2 # rd = (-!!rs1) & rs2 == (!!rs1) ? rs2 : 0
>
> andn.ne rd rs1 rs2 # rd = (-!!rs1) &~ rs2 == (!!rs1) ? ~rs2 : 0
>
> orn.eq rd rs1 rs2 # rd = ~((-!!rs1) & rs2) == (-!rs1) |~ rs2 == (!!rs1) ? ~rs2 : (-1)
>
> or.eq rd rs1 rs2 # rd = ~((-!!rs1) &~ rs2) == (-!rs1) | rs2 == (!!rs1) ? rs2 : (-1)
>
>
> #or rd rs1 rs2 # rd = rs1 | rs2
>
> nor rd rs1 rs2 # rd = ~(rs1 | rs2)
>
> #xor rd rs1 rs2 # rd = rs1 ^ rs2
>
> xnor rd rs1 rs2 # rd = ~(rs1 ^ rs2) == rs1 ^ rs2 ^ (-1) == rs1 ^ ~ rs2
>
>
> This would make AND a 3 bit family (i.e. a basic instruction with some “option flags” or small parameters, here nothing/~ on rs2, nothing/-!! on rs1, nothing/~ on rd) and OR and XOR a 1 bit family of instructions (nothing/~ on rd) that can be placed with the AND OR and XOR instructions themselves (i.e an instructions ANDN, NAND and AND.NE are supposed to have the same func3 values as AND being “AND”-like, but differ in a few bits in the funct7 space ).

My suspicion is this is all going well beyond the point of general
usefulness and benefit.

In fact I have my doubts that anything more than andn is justified by
actual frequency of use in general code. That's not to say we can't
evaluate the others.

You may very well be right. The question is does it give compiler writers extra room to wiggle.

:-)

> Page 29 2.7.2
>
>
> Comment:
>
> We have the same issue with the ternary operator here of course.
>
>
> If we have the AND.NE instruction we can proceed as for the mix operator and define the fusable sequences
>
>
> CMOV rd rs1 rs2 rs3. # rs1 ? rs2 : rs3 assume rd != rs1, rs3
>
> {
>
> XOR rd rs2 rs3, assume rd != rs1, rs3
>
> AND.NE rd rs1 rd
>
> XOR rd rd rs3
>
> }

Ok, I'm warming to your AND.NE. But maybe not all its friends.

:-)

> Page 31 Section 2.8 Unsigned address calculation
>
>
> Comment: I like these instructions. Since adds are so dominant in usage and in particular in address calculations in most code they may well turn out the most used instructions of the B extension.

Yes. Which I think justifies the expansion of the definition of B.
Though you could sneak it in under the justification that bit
manipulations are traditionally regarded as being done on bags of bits
aka unsigned numbers, especially shifts.

It is all about sign extension, a bit manipulation thing :-)

> Likewise I think there is value in having pointer + integer instruction i.e. an instruction that absorbs the shift like LEA in x86 (one of the most commonly used arithmetic instructions on x86, although that is a bit misleading because it is also the 2 operand add on x86). People have been touting fusion of
>
> SLLI rd rs2 imm
> ADD rd rs1 rd
>
> but note that one cannot use the RVC version of SLLI unless rd == rs2 (the register containing the index) and for basic implementations an “already fused instruction”
>
> ADDSL<k> rd rs1 rs2 # SLLI rd rs2 <k> ; ADD rd rs1 rd
>
> actually seems easier to implement than fusion.

I like it, except:

1) isn't there a real danger that cascading a shift into an add
impacts the cycle time? Yes, some implementations could split it, but
the general RISC-V philosophy is fusion not splitting because this
puts the complexity burden on the already-complex implementations not
on the very simple implementations.

I can see both arguments, but the shift is never be more than 3 stages (i.e. up to 7) (and it may be entirely reasonable to only have 2 stage shifts of 0,1,2,3 for RV32) and my assumption was that should leave enough time for an add. Finally I think that while fusion is a great optimisation and simplification technique, I think there are two cases where fusion is being touted as essentially being part of the ISA: this one, and the slli srai trick for sign extension from bytes and halves (the zero extension being already taken care of). I think that in those cases an instruction which is slightly more flexible and doesn't force the complexity of fusion on simple implementations, and provides natural further fusion opportunities is actually better.

2) in most uses, which is in loops, the whole thing is usually
strength-reduced to ADDI rd, rd, 1<<k anyway.

Yes there is good reason why it is not in I.

> Page 32, Section 2.10 Future compressed instructions
>
>
> I would take this section out, or just suggest that C.NOT, C.NEG and C.ZEXT.W and C.ZEXT.D may be worthwhile additions. More quantitative data is needed however, and this just cannot be done just for the B extension. For example I think another instruction that may be worthwhile is
>
>
>
> C.LPC rd —> AUIPC rd 0. # load program counter

I like this, although note that C.JAL with an offset of 2 serves the
same purpose and could if necessary be recognized as a special case.
It would however force savinf of RA even in leaf functions that don't
otherwise require it.

Yes. Also C.JAL is only available on RV32. The main use case would be PC relative addressing: something like

L1:

C.LPC a5. #load PC relative constant

lw a5 (L2 - L1)(a5)

...

L2:

.word 0xabcd

Yours

Rogier

Ben Marshall

unread,

Sep 19, 2019, 4:25:58 AM9/19/19

to Rogier Brussee, RISC-V ISA Dev

> While rol and ror are the easiest to explain permutation, they are also the most easily done with just shifts and OR’s. The butterfly network is also not really simpler than that for GREV, i.e slightly expensive. With the below SLLNEG/SRLNEG and more importantly, immediate versions, it is just 3 instructions. MIPS doesn’t have a ROL and ROR instruction. Are ROL and ROR actually worth it?

I agree, the benefit is marginal and only in pretty specialized code,
much of which would be much better accelerated with a proper SHA or
whatever instruction.

I have a few overlapping comments about ROR/ROL.

- These are essential to cryptographic code. Without them, you cannot be
performant for the common standardised block ciphers or hash functions.

- RISC-V will/must have a rotate instruction. Clearly there is discussion
to be had about whether it lives in bitmanip or in the Crypto extension.

- I guess one can argue that cryptographic workloads are "specialized".
That said, I believe that they are important enough and frequently
executed enough to mean that they deserve special attention.

- Linux uses ChaCha20 as its CRNG implementation, which suffers very
    heavily without a rotate instruction. [1,3,4]

- A shift/or based rotate is between 48 and 96 bits of instruction data to
fetch. As opposed to (assuming no compressed version) 32-bits of
rotate. The energy efficiency impact on crypto code for having to
execute three instructions rather than one will not be good. I don't
have numbers for this, but I'd be keen to see the energy efficiency
impact of favouring shift/or over rotate.

- A fused shift/or implementation requires an extra working register.

- The Crypto Task Group have discussed sharing instructions between crypto
and bitmanip. E.g. if one wanted to implement the crypto extension and not
bitmanip, you could still used the ROR/ROL instructions as defined in
bitmanip, to save crypto defining it's own ROR/ROL.

- I don't think one can argue against ROR/ROL by saying any of
their use cases are better catered too by a dedicated accelerator.
Such accelerators lack flexibility, and are too expensive for many cores
or SoCs where ROR/ROL would be most useful. I.e. at the smaller end of
things.

- Even the most basic ARM micro-controller has rotate functionality.
Many of the new NIST lightweight cryptography[2] candidates
provide reference code for ARM. Any attempt to port these to RISC-V
will immediately show that it's lack of a rotate instruction severely
impacts their performance. This has already been noted by our research
group and others[3].

- On a very specialised but also important note: when writing power/EM
    side channel resistant code, ROR/ROL has a useful property in that
    it does not change the hamming weight of its inputs. This is used in
    the security arguments for some classes of ARX (Add, rotate xor)
block cipher. Having to emulate with shift/or will absolutely change
    the hamming weight of the inputs. The NIST competition explicitly
    asks for this sort of consideration of side-channel resilience
    from the candidates.

Cheers, Ben
1. https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant
2. https://csrc.nist.gov/Projects/Lightweight-Cryptography
3. https://eprint.iacr.org/2019/794.pdf
4. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/random.c?h=v5.2.16#n835

lkcl

unread,

Sep 19, 2019, 4:33:43 AM9/19/19

to RISC-V ISA Dev, rogier....@gmail.com

On Thursday, September 19, 2019 at 9:25:58 AM UTC+1, Ben Marshall wrote:

I have a few overlapping comments about ROR/ROL.

- I guess one can argue that cryptographic workloads are "specialized".

That said, I believe that they are important enough and frequently
executed enough to mean that they deserve special attention.

you would be amazed at how ineffective googling "uses for ROR instruction" is at finding answers, there :)

this is all i've been able to find, so far:

https://stackoverflow.com/questions/4976636/whats-the-purpose-of-the-rotate-instructions-rol-rcl-on-x86

"ROL and ROR are useful for examining a value bit-by-bit in a way that is (ultimately) non-destructive. They can also be used to shunt a bitmask around without bringing in garbage bits."

"Many algorithms in graphics and cryptography use rotation, and their inclusion in CPUs makes it possible to write very fast algorithms in assembly."

"Pretty arcane and rarely used? Really? There are many places where rotation is useful, especially in, as you say hashing and cryptography. On many CPU's where the amount shifted affects time, it's actually faster to rotate and bitwise and rather than doing a shift."

"Yes, very rarely used. Hashing and crypto are things to be used from libraries, not something every developer should write for themselves."

l.

Jacob Lifshay

unread,

Sep 19, 2019, 4:53:22 AM9/19/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Rogier Brussee

additional uses for rotate instructions: the default fast/insecure (but not the overall default, that's a psrng) random number generator in Rust (`SmallRng` in the rand crate -- is currently a pcg algorithm) for 64-bit platforms uses rotate instructions to help decorrelate the generated numbers:

see definition of SmallRng in rand 0.7.2:

https://docs.rs/rand/0.7.2/rand/rngs/struct.SmallRng.html

sidenote: if you haven't heard of the pcg class of random number generators, they're very interesting -- much smaller (both code and data), faster, and better than even mersenne twister.

Jacob Lifshay

Sober Liu

unread,

Sep 19, 2019, 5:29:35 AM9/19/19

to Jacob Lifshay, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Rogier Brussee

I see Dhrystone was discussed today. While Dhrystone result is hard to judge (e.g., equivalent compiler options between armcc and arm-gcc).

How is the expected score for Coremark? So far I didn’t see it in this mail thread.

Thanks.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAC2bXD7YfcXgUd2By3s5Z%3DD0b0Oun%3DjyuUUW%2B31GgHo9wm9vnQ%40mail.gmail.com.

This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

Rogier Brussee

unread,

Sep 19, 2019, 10:43:13 AM9/19/19

to RISC-V ISA Dev, rogier....@gmail.com

Op donderdag 19 september 2019 10:25:58 UTC+2 schreef Ben Marshall:

> While rol and ror are the easiest to explain permutation, they are also the most easily done with just shifts and OR’s. The butterfly network is also not really simpler than that for GREV, i.e slightly expensive. With the below SLLNEG/SRLNEG and more importantly, immediate versions, it is just 3 instructions. MIPS doesn’t have a ROL and ROR instruction. Are ROL and ROR actually worth it?

I agree, the benefit is marginal and only in pretty specialized code,
much of which would be much better accelerated with a proper SHA or
whatever instruction.

It was very much a question. Your opinion is much better informed than mine. Thanks for joining the discussion.

I have a few overlapping comments about ROR/ROL.

- These are essential to cryptographic code. Without them, you cannot be
performant for the common standardised block ciphers or hash functions.

- RISC-V will/must have a rotate instruction. Clearly there is discussion
to be had about whether it lives in bitmanip or in the Crypto extension.

- I guess one can argue that cryptographic workloads are "specialized".
That said, I believe that they are important enough and frequently
executed enough to mean that they deserve special attention.

- Linux uses ChaCha20 as its CRNG implementation, which suffers very
heavily without a rotate instruction. [1,3,4]

I cannot really judge how relevant this very focussed micro benchmark within the kernels use of CRNG is, but it clearly takes a (surprisingly big) hit.

- A shift/or based rotate is between 48 and 96 bits of instruction data to
fetch.

it seems hard to get 48 bit, but 64 bit is doable (with an immediate and RVC)

srli a5, a0, (-imm)

C.slli a0, a0, imm

C.or a0, a0, a5

As opposed to (assuming no compressed version) 32-bits of
rotate. The energy efficiency impact on crypto code for having to
execute three instructions rather than one will not be good. I don't
have numbers for this, but I'd be keen to see the energy efficiency
impact of favouring shift/or over rotate.

Sounds like valuable data to obtain.

- A fused shift/or implementation requires an extra working register.

Yes which makes the pattern unfusable. The last shift/or pair could be fused, however.

- The Crypto Task Group have discussed sharing instructions between crypto
and bitmanip. E.g. if one wanted to implement the crypto extension and not
bitmanip, you could still used the ROR/ROL instructions as defined in
bitmanip, to save crypto defining it's own ROR/ROL

.

- I don't think one can argue against ROR/ROL by saying any of
their use cases are better catered too by a dedicated accelerator.
Such accelerators lack flexibility, and are too expensive for many cores
or SoCs where ROR/ROL would be most useful. I.e. at the smaller end of
things.

- Even the most basic ARM micro-controller has rotate functionality.
Many of the new NIST lightweight cryptography[2] candidates
provide reference code for ARM. Any attempt to port these to RISC-V
will immediately show that it's lack of a rotate instruction severely
impacts their performance. This has already been noted by our research
group and others[3].

And so does x86. Wide availability most certainly encourages their use.

I wanted to point out, however, that even the latest update of the MIPS instruction does not include them.

- On a very specialised but also important note: when writing power/EM
    side channel resistant code, ROR/ROL has a useful property in that
    it does not change the hamming weight of its inputs. This is used in
    the security arguments for some classes of ARX (Add, rotate xor)
block cipher. Having to emulate with shift/or will absolutely change
    the hamming weight of the inputs. The NIST competition explicitly
    asks for this sort of consideration of side-channel resilience
    from the candidates.

From the security point of view,

Do you think a security oriented ROR/ROL imposes special implementation restrictions?
Do you see value in a ROL instruction (which is ROR with -rs2) if ROR is provided?
Do you see value in the CLMUL and BMATXOR instructions? Does security impose special implementation considerations for these instructions?
Do you see bitmanip-like primitives you feel would be very useful that have been missed ?

Ciao

Rogier

lkcl

unread,

Sep 20, 2019, 12:19:06 AM9/20/19

to RISC-V ISA Dev, rogier....@gmail.com

Also I haven't seen any mention of video encode and decode examples. I would amticipate that at the very least, BitManip would be used for efficient YUV RGB conversion, across a range of bit widths on either side (RGB8888, RGB565 etc).

Whilst a case can be made for doing video as a hardware accelerated block, the fact that recent ARM64 processors can do virtually any 720p modern and older algorithm with NEON only is extremely compelling.

With video decode being such a huge market the case for its investigation and inclusion *prior* to finalisation of BitManip should be really clear.

What is the process by which this may be discussed, in a way that is inclusive to all stakeholders?

L.

Rogier Brussee

unread,

Sep 20, 2019, 4:29:02 AM9/20/19

to RISC-V ISA Dev, rogier....@gmail.com

Op vrijdag 20 september 2019 06:19:06 UTC+2 schreef lkcl:

Also I haven't seen any mention of video encode and decode examples. I would amticipate that at the very least, BitManip would be used for efficient YUV RGB conversion, across a range of bit widths on either side (RGB8888, RGB565 etc).

If nothing else, that sounds like a great real world testcase to me. You and Jacob seem to have built up considerable expertise in this area. Maybe you have a little C (or Rust) program that does the essential step that you can share? If you could also indicate how you think it would efficiently compile to the current IMC + Bitmanip (or conversely why you fail to have an efficient solution), that would be very useful too, I think. I am sure Clifford or others can chime in if things can be done more efficiently.

Ciao

Rogier

Ben Marshall

unread,

Sep 20, 2019, 4:36:10 AM9/20/19

to Rogier Brussee, RISC-V ISA Dev

On 19/09/2019 15:43, Rogier Brussee wrote:

Op donderdag 19 september 2019 10:25:58 UTC+2 schreef Ben Marshall:

> While rol and ror are the easiest to explain permutation, they are also the most easily done with just shifts and OR’s. The butterfly network is also not really simpler than that for GREV, i.e slightly expensive. With the below SLLNEG/SRLNEG and more importantly, immediate versions, it is just 3 instructions. MIPS doesn’t have a ROL and ROR instruction. Are ROL and ROR actually worth it?

I agree, the benefit is marginal and only in pretty specialized code,
much of which would be much better accelerated with a proper SHA or
whatever instruction.

It was very much a question. Your opinion is much better informed than mine. Thanks for joining the discussion.

I have a few overlapping comments about ROR/ROL.

- These are essential to cryptographic code. Without them, you cannot be
performant for the common standardised block ciphers or hash functions.

- RISC-V will/must have a rotate instruction. Clearly there is discussion
to be had about whether it lives in bitmanip or in the Crypto extension.

- I guess one can argue that cryptographic workloads are "specialized".
That said, I believe that they are important enough and frequently
executed enough to mean that they deserve special attention.

- Linux uses ChaCha20 as its CRNG implementation, which suffers very
heavily without a rotate instruction. [1,3,4]

I cannot really judge how relevant this very focussed micro benchmark within the kernels use of CRNG is, but it clearly takes a (surprisingly big) hit.

Indeed. Apologies, I didn't mean to waste time pointing at a single
micro-benchmark. My point should have been to say that these things
can pop up in surprising places and take a surprisingly big hit, which
because it is low in the software stack, has an outsize effect on the
wider system.
Especially because Crypto code tends to be wrapped up in a library, which developers are (understandably) taught not to try and re-roll themselves.

- A shift/or based rotate is between 48 and 96 bits of instruction data to
fetch.

it seems hard to get 48 bit, but 64 bit is doable (with an immediate and RVC)

srli a5, a0, (-imm)

C.slli a0, a0, imm

C.or a0, a0, a5

As opposed to (assuming no compressed version) 32-bits of
rotate. The energy efficiency impact on crypto code for having to
execute three instructions rather than one will not be good. I don't
have numbers for this, but I'd be keen to see the energy efficiency
impact of favouring shift/or over rotate.

Sounds like valuable data to obtain.

Agreed. I think it's been missing from a lot of the macro-op fusion based
justifications around RISC-V.

I can't see why ROR/ROL would be disproportionately expensive compared to something like a SLL/SRL.

Do you see value in a ROL instruction (which is ROR with -rs2) if ROR is provided?

Crypto code always does it's shift/rotations by constant amounts.
For the most part, that means you'd use the immediate rotate/shift instruction anyway.

The only time I've seen the need for a variable ROR/ROL is when the
rotation distance is dependent on something like a table of round
constants (like the Keccak Round function for SHA3), which get loaded
in from memory at the start of a round. Even then, you can just change
the round constant in memory to achieve the same effect. So, no, I've
not yet seen enough examples for crypto where ROL is essential in
addition to ROR.

Do you see value in the CLMUL and BMATXOR instructions? Does security impose special implementation considerations for these instructions?

- I certainly see CLMUL being useful for AES-GCM. As long as it is
a constant time implementation (i.e. has no early-out mechanism) I
can't think of any obvious security headaches without studying it
properly. It's use in pseudo-random number generation might also be
useful in some software based power/EM side-channel countermeasures,
which require (relatively) large amounts of randomness to work.

- I suspect BMATXOR will be found to be useful for some algorithms or
primitives, but I haven't found them yet. Looking for them is on the
todo list. I'm not familiar enough with implementations of this to
comment on the security implications yet.

Do you see bitmanip-like primitives you feel would be very useful that have been missed ?

Several. One of our research projects at the moment is looking at
scalar cryptography on RISC-V, and we were delighted when several very
useful instructions we had identified were also discussed separately in
Bitmanip:

- CMOV is useful for constant time implementations.

- I believe that CMIX will prove useful for certain types of power/EM
countermeasures, but I haven't got far enough into that yet to be
sure.

- The funnel shifts are useful for 64-bit rotations on 32-bit platforms,
which can be a feature of some lightweight block ciphers.

- The pack instructions are very useful for ciphers which operate on
sub-word elements. SPARX is a good example, since it uses 16-bit
elements and performs rotations.

- The bitfield scatter/gather instructions look promising for
implementing very irregular permutation layers.

- The grev/shfl instructions are on my todo list. They look like they
should be useful for permutation layers, but I need to see how well
they map.

Cheers,
Ben

Ciao

Rogier

Cheers, Ben
1. https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant
2. https://csrc.nist.gov/Projects/Lightweight-Cryptography
3. https://eprint.iacr.org/2019/794.pdf
4. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/random.c?h=v5.2.16#n835

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/be20eda0-d1ac-4649-a7e4-b5dfe8088464%40groups.riscv.org.

Rogier Brussee

unread,

Sep 20, 2019, 7:22:01 AM9/20/19

to RISC-V ISA Dev, rogier....@gmail.com

Op vrijdag 20 september 2019 10:36:10 UTC+2 schreef Ben Marshall:

On 19/09/2019 15:43, Rogier Brussee wrote:

Op donderdag 19 september 2019 10:25:58 UTC+2 schreef Ben Marshall:

[snip]

Thanks for your valuable input!

As opposed to (assuming no compressed version) 32-bits of
rotate. The energy efficiency impact on crypto code for having to
execute three instructions rather than one will not be good. I don't
have numbers for this, but I'd be keen to see the energy efficiency
impact of favouring shift/or over rotate.

Sounds like valuable data to obtain.

Agreed. I think it's been missing from a lot of the macro-op fusion based
justifications around RISC-V.

I agree.

[snip]

- A fused shift/or implementation requires an extra working register.Do you think a security oriented ROR/ROL imposes special implementation restrictions?

I can't see why ROR/ROL would be disproportionately expensive compared to something like a SLL/SRL.

It should be about as expensive as SLL/SRL, just that SLL SRL are already required. GREV should be similarly expensive, but not quite so easy to do with SLL/SRL. You made good points for having a ROR instruction though.

Do you see value in a ROL instruction (which is ROR with -rs2) if ROR is provided?

Crypto code always does it's shift/rotations by constant amounts.
For the most part, that means you'd use the immediate rotate/shift instruction anyway.

The only time I've seen the need for a variable ROR/ROL is when the
rotation distance is dependent on something like a table of round
constants (like the Keccak Round function for SHA3), which get loaded
in from memory at the start of a round. Even then, you can just change
the round constant in memory to achieve the same effect. So, no, I've
not yet seen enough examples for crypto where ROL is essential in
addition to ROR.

That confirms by poorly informed impression.

[snip]

- CMOV is useful for constant time implementations.

Would constant timing also be satisfied with an instruction

AND.NE rd rs1 rs2 # rd = (rs1)?rs2: 0 == (-!!rs1)&rs2

and a macro /compiler technique/fusionable sequence (if you really want to)

CMOV rd rs1 rs2 rs3. # rs1 ? rs2 : rs3 assume rd != rs1, rs3
{
XOR rd rs2 rs3, assume rd != rs1, rs3
  AND.NE rd rs1 rd
  XOR rd rd rs3
}

perhaps with an additional primitive for the common rd == rs1 case

ELVIS rd rs1 rs2 # rd = (rs1)?rs1: rs2

?

- I believe that CMIX will prove useful for certain types of power/EM
countermeasures, but I haven't got far enough into that yet to be
sure.

Is there a reason why using a sequence of simple bit instructions like

CMIX rd rs1 rs2 rs3 # rd = rs1&rs2 | (~rs1)&rs3 , assume rd != rs1, rs3

{

XOR rd rs2 rs3,
AND rd rd rs1

XOR rd rd rs3

}

is problematic from a power/EM point of view?

[snip]

Thanks!

Rogier

Cheers,
Ben

Ciao

Rogier

Cheers, Ben
1. https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant
2. https://csrc.nist.gov/Projects/Lightweight-Cryptography
3. https://eprint.iacr.org/2019/794.pdf
4. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/random.c?h=v5.2.16#n835

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa...@groups.riscv.org.

lkcl

unread,

Sep 20, 2019, 8:54:58 AM9/20/19

to RISC-V ISA Dev, rogier....@gmail.com

On Friday, September 20, 2019 at 4:36:10 PM UTC+8, Ben Marshall wrote:

> Especially because Crypto code tends to be wrapped up in a library, which developers are (understandably) taught not to try and re-roll themselves.

Sometimes, the sensible ones listen! YesI have been on sci.crypt for a while back around 2003 :)

The appearance of primitives in BitManip that are suitable for cryptography: I have a couple of questions.

Firstly, it seems to be quite important to have them, despite their specialist use: it's not like the linux kernel is not popular or anything, and without a good PRNG a heck of a lot of things break. Would you be happy to write up a suitable section to be added to the spec which explains this?

Secondly, along the lines of RISC principles, it is sounding like a separate crypto Extension *is* the BitManip extension, given the applicability of so many of the opcodes to hashing and symmetric ciphers. Would you concur?

L.

lkcl

unread,

Sep 20, 2019, 9:38:47 AM9/20/19

to RISC-V ISA Dev, rogier....@gmail.com

On Friday, September 20, 2019 at 4:29:02 PM UTC+8, Rogier Brussee wrote:
> Op vrijdag 20 september 2019 06:19:06 UTC+2 schreef lkcl:Also I haven't seen any mention of video encode and decode examples. I would amticipate that at the very least, BitManip would be used for efficient YUV RGB conversion, across a range of bit widths on either side (RGB8888, RGB565 etc).
>
>
> If nothing else, that sounds like a great real world testcase to me. You and Jacob seem to have built up considerable expertise in this area.

Well, it is more that, as a small team, we've taken *on* a task that requires considerable expertise, the only saving grace being, it's a well trodden path.

That just leaves time sigh. Vulkan and 3D has beem the primary initial focus, and it has YUV to RGB conversion, passthrough from GLSL which is typically used as a conduit for video.

You may have seen demos by NVIDIA of spinning cubes playing videos on each one, and Firefox *only* plays accelerated decoded videos through OpenGL.

Bottom line, we simply can't deal with this without outside help. I can put in a request for funding for up to EUR 50,000 to get some financial backing for it, but with so much else to do...

> Maybe you have a little C (or Rust) program that does the essential step that you can share? If you could also indicate how you think it would efficiently compile to the current IMC + Bitmanip (or conversely why you fail to have an efficient solution), that would be very useful too, I think. I am sure Clifford or others can chime in if things can be done more efficiently.

That basically means tracking down the relevant parts of ffmpeg, looking at the source, bear in mind that it really is that simple. The compiler can and must be capable of handling BitManip backed up by SV parallelism in the high end performance case, and could get away without SV (doing manual for loops) in the low to mid end performance cases

https://www.ffmpeg.org/doxygen/2.6/yuv2rgb_8c_source.html

Where after a lot of futzing around I found this:

https://ffmpeg.org/pipermail/ffmpeg-cvslog/2013-December/072060.html

Which will need some close inspection and conversion, and tracking down similar hotspots.

It's a lot of work and if there is anyone that would like to be funded to do this there is a deadline coming up Oct 1st, otherwise the next one is Dec 1st, and there is around a 3 month approval latency.

L.

Ben Marshall

unread,

Sep 20, 2019, 9:38:52 AM9/20/19

to Rogier Brussee, RISC-V ISA Dev

Yes, that looks like it would do it. By constant time, I really meant
the avoidance of control flow changes.
I guess there's a bigger discussion to be had around 3-source
instructions and their merit. Personally, I'm in favour, but I don't
have any reasoning to share that hasn't already been part of those
discussions. Maybe only to say that crypographic operations tend to
involve multiple-inputs and multiple-outputs, and having that
instruction format might make it easier to express some of the
more specialised operations in a dedicated crypto extension.

- I believe that CMIX will prove useful for certain types of power/EM
countermeasures, but I haven't got far enough into that yet to be
sure.

Is there a reason why using a sequence of simple bit instructions like

CMIX rd rs1 rs2 rs3 # rd = rs1&rs2 | (~rs1)&rs3 , assume rd != rs1, rs3

{

XOR rd rs2 rs3,
AND rd rd rs1

XOR rd rd rs3

}

is problematic from a power/EM point of view?

In general, it is very hard to reason about power/EM properties of
a piece of cryptographic software above the ISA level.
One really needs to consider the micro-architecture of the
core (and indeed SoC) being evaluated. I temper any assertions I go on to make with that caution.

Most generally, more instructions mean more intermediate results,
which are potential causes of information leakage.
The more instructions you have, the more data flow there is to track and
potentially have to protect.
One macro-instruction might also leak information though, so it can go
either way. I'm sorry I don't have a more definite answer on that one.

Some of the experiments we're doing involve trying to answer this
sort of question, but it's early days and I'd be uncomfortable saying
anything more concrete yet. I don't want to cloud the discussion with
things that the scientific literature is wary of.

[snip]

Thanks!

Rogier

Cheers,
Ben

Ciao

Rogier

Cheers, Ben
1. https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant
2. https://csrc.nist.gov/Projects/Lightweight-Cryptography
3. https://eprint.iacr.org/2019/794.pdf
4. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/random.c?h=v5.2.16#n835

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/be20eda0-d1ac-4649-a7e4-b5dfe8088464%40groups.riscv.org.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9764e488-2f59-4d91-ad53-4ce06b642666%40groups.riscv.org.

Ben Marshall

unread,

Sep 20, 2019, 9:52:33 AM9/20/19

to lkcl, RISC-V ISA Dev, rogier....@gmail.com

On 20/09/2019 13:54, lkcl wrote:
> On Friday, September 20, 2019 at 4:36:10 PM UTC+8, Ben Marshall wrote:
>
>> Especially because Crypto code tends to be wrapped up in a library, which developers are (understandably) taught not to try and re-roll themselves.
> Sometimes, the sensible ones listen! YesI have been on sci.crypt for a while back around 2003 :)
>
> The appearance of primitives in BitManip that are suitable for cryptography: I have a couple of questions.
>
> Firstly, it seems to be quite important to have them, despite their specialist use: it's not like the linux kernel is not popular or anything, and without a good PRNG a heck of a lot of things break. Would you be happy to write up a suitable section to be added to the spec which explains this?

The Bitmanip spec already mentions the PRNG use case for CLMUL.
I don't know if that sort of explanation of use-cases belongs in the spec
or somewhere else, but I agree it's a useful thing to have.

> Secondly, along the lines of RISC principles, it is sounding like a separate crypto Extension *is* the BitManip extension, given the applicability of so many of the opcodes to hashing and symmetric ciphers. Would you concur?

Almost. I disagree that the bitmanip extension serves entirely as the
crypto extension.
There is clearly a lot of overlap, and my hope is that much of the
overlapping functionality can be shared between the two.
e.g. One could implement just the crypto extension or just bitmanip
and use the same opcode for ROR/CMOV etc.

There are however other things which a crypto extension will need which
do not belong in Bitmanip. Public key cryptography support is the big one,
which boils down to long arithmetic. Current RISC-V support for long
arithmetic suffers because of how it handles carry, and the lack of
indexed load and store.

Having a dedicated crypto extension also means you can add very specialist
instructions for algorithms like AES/SHA*, which clearly belong on
their own special set.

Cheers,
Ben

>
> L.
>

lkcl

unread,

Sep 20, 2019, 10:01:29 AM9/20/19

to RISC-V ISA Dev, rogier....@gmail.com

On Friday, September 20, 2019 at 2:38:52 PM UTC+1, Ben Marshall wrote:

is problematic from a power/EM point of view?

In general, it is very hard to reason about power/EM properties of
a piece of cryptographic software above the ISA level.
One really needs to consider the micro-architecture of the
core (and indeed SoC) being evaluated. I temper any assertions I go on to make with that caution.

rogier: cryptography is predicated on the statistical principle that for a one bit change in the input, *on average* 50% of the bits of the output will change.

thus, in the extreme, this principle has to be applied *literally* right the way through the high-level design, the low-level primitives, the opcodes *and* the f****** hardware!

in addition, as ben is rightly pointing out, you simply cannot afford to have a timing change (change in the completion of the algorithm, based on the input).

the depths that you have to go to to fully meet these criteria are.. insane.

take for example a single wire in a chip. flip the bit, you get some power consumption, right? no problem, right? except... to an attacker, *that's information*! the fact that the transition from 0 to 1 or from 1 to 0 occurred and caused a detectable variation in power consumption provides information based on a statistical analysis of what input bits were flipped.

so to deal with that, what you have to do is: have *two* differential pairs for every one wire in a "normal" chip. one of those differential pairs is the "real" one, and when the "real" one is not transitioning, you flip the FAKE one instead!

why differential pairs? because the power-transition from 1 to 0 might not consume the exact same amount of power as a transition from 0 to 1. a diff-pair will at least use the same power (and also reduce EM).

so now we're looking at a minimum of 4x the power consumption.

caches are also out.

so by "insane" i mean, "it's insane". every single performance-optimisation ever dreamed of over the past 50 years of computer science has to be eliminated, in the name of commmplete paranoia.

if you are not so concerned about power analysis, but only timing analysis, it is *only* slightly easier.

last year someone came up with the bright but naive idea, bless 'em, that RVV could be hard-coded-designed to provide timing-analysis guarantees. this would have been absolutely massively detrimental to RVV if carried through. i do not know if they actually listened to my advice which summarised as "under NO circumstances go down that route: save it for a specialist Crypto-RVV Extension".

l.

lkcl

unread,

Sep 20, 2019, 10:39:30 AM9/20/19

to RISC-V ISA Dev, luke.l...@gmail.com, rogier....@gmail.com

On Friday, September 20, 2019 at 2:52:33 PM UTC+1, Ben Marshall wrote:

The Bitmanip spec already mentions the PRNG use case for CLMUL.
I don't know if that sort of explanation of use-cases belongs in the spec
or somewhere else,

yes absolutely it does. these things are very important, and need to be "collated" in one place, preferably by someone with suitable hands-on expertise (hint, hint :) )

but I agree it's a useful thing to have.
> Secondly, along the lines of RISC principles, it is sounding like a separate crypto Extension *is* the BitManip extension, given the applicability of so many of the opcodes to hashing and symmetric ciphers. Would you concur?

Almost. I disagree that the bitmanip extension serves entirely as the
crypto extension.

(see below)

There is clearly a lot of overlap, and my hope is that much of the
overlapping functionality can be shared between the two.
e.g. One could implement just the crypto extension or just bitmanip
and use the same opcode for ROR/CMOV etc.

There are however other things which a crypto extension will need which
do not belong in Bitmanip. Public key cryptography support is the big one,
which boils down to long arithmetic.

had a lot of fun with that, back in 2003. imagine doing big integer math on a massively-parallel 4096-wide array of 2-bit ALUs.... :)

Current RISC-V support for long
arithmetic suffers because of how it handles carry, and the lack of
indexed load and store.

Bruce: didn't you mention something about a bitmanip primitive that would hypothetically work as a "carry" substitute?

Also, for the 3D GPU/VPU, we're missing an indexed MV (suitable for s-boxes) of the form regs[rd] = regs[rd + regs[rs1]] and yes, indexed LD/ST as well, all of which we will need to add as scalar ops then drop SV on top to parallelise them.

In addition, I don't know if you have heard of the ASP: it was developed by Argy Krikelis, Ian Jalowiecki et al, one of its applications was for the Star Wars Programme (!!) https://www.researchgate.net/profile/Argy_Krikelis/research

my job back in 2003 was as a FAE for Aspex Semi, specialising in cryptography. the team had put in a special general-purpose opcode which would "travel" between the 2-bit ALUs in a similar fashion to the RVV "set-before-first mask bit" instruction:

https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vmsbf-m-set-before-first-mask-bit

these i *believe* are similar - i apologise this was almost 15 years ago - to the instruction that we used as a "carry propagator". that carry-propagation was between 2-bit ALUs however it worked perfectly well when the ALUs were opened up to create 4-long ALUs, 8-long ALUs, 16, 32, 64, 128, 256, 4096 and even off-chip to create 65536-long additions in a single cycle. obviously we had to cope with ripple-carry delay, the longer the chain got, but that was just a timing computation.

although in the ASP we actually did the ADDs as XORs and ANDs (yes, really!) plus this instruction-which-could-be-used-for-carry-propagation, the same principle i believe could be applied here, between 32-bit / 64-bit additions, performing a single-instruction pre-analysis as to whether a carry to the next 32/64-bit in the chain will be needed. it's... marginally wasteful, but a hell of a lot better than futzing about with comparators to check if overflow might occur.

so can i ask you: if an instruction *did* exist which allowed you to *detect* if carry were needed, what would it actually look like? would any of those RVV "Mask" operations do the trick?

Having a dedicated crypto extension also means you can add very specialist
instructions for algorithms like AES/SHA*, which clearly belong on
their own special set.

if going down the route of hard-macros for certain functions, with a single instruction activating them, an ISA is not a good fit: the amount of time it takes would result in huge complications (see the memset thread from last week).

a DMA interface on the other hand would be perfect, and that's not within the scope of an ISA: it's a platform architectural decision.

about the big integer math: i *believe* some people have been considering RVV as a suitable platform for that. unfortunately, the RVV spec is sparse on examples in this regard, including here:

https://github.com/riscv/riscv-v-spec/blob/master/vector-examples.adoc

l.

Jacob Lifshay

unread,

Sep 20, 2019, 4:34:30 PM9/20/19

to Ben Marshall, Rogier Brussee, RISC-V ISA Dev

On Fri, Sep 20, 2019, 01:36 Ben Marshall <bm1...@bristol.ac.uk> wrote:

Do you see value in a ROL instruction (which is ROR with -rs2) if ROR is provided?

Crypto code always does it's shift/rotations by constant amounts.
For the most part, that means you'd use the immediate rotate/shift instruction anyway.

The only time I've seen the need for a variable ROR/ROL is when the
rotation distance is dependent on something like a table of round
constants (like the Keccak Round function for SHA3), which get loaded
in from memory at the start of a round. Even then, you can just change
the round constant in memory to achieve the same effect. So, no, I've
not yet seen enough examples for crypto where ROL is essential in
addition to ROR.

Please don't leave out one of ROL or ROR, there are common algorithms that require variable rotates, one of them is the PCG class of random number generators that I mentioned earlier as a much better random number generator due to being smaller, faster, and producing better output than even Mersenne Twister. I would not be surprised if there are more examples that use variable rotates for random numbers, hashes, and other things that I'm not aware of.

Jacob

Allen Baum

unread,

Sep 20, 2019, 6:46:49 PM9/20/19

to Jacob Lifshay, Ben Marshall, Rogier Brussee, RISC-V ISA Dev

Quickly:
There is a separate crypto extension that has a lot of work out into it. It may share some opcodes with bitmanip ( which means those ops are present if either extension is implemented). Or not.

Secondly: if you want to avoid power side channels, don’t bother trying to craft specially implemented opcodes; just add a crypto coprocessor that is specially designed to do that (Crrypography Research, a division of Rambus, has demonstrated this at several events and licenses the cores.

-Allen

>

Clifford Wolf

unread,

Sep 21, 2019, 8:14:58 AM9/21/19

to Rogier Brussee, RISC-V ISA Dev

Hi,

On Wed, Sep 18, 2019 at 3:43 PM Rogier Brussee <rogier....@gmail.com> wrote:

Op maandag 16 september 2019 09:46:51 UTC+2 schreef clifford:
On Mon, Sep 9, 2019 at 8:13 PM Rogier Brussee <rogier...@gmail.com> wrote:
[..] For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.

That's not entirely true. Because, as you have stated correctly, overflow for signed data types is undefined in C, it is safe for a compiler to not sign-extend in most cases involving signed data types. There's no need to adhere to the calling convention once undefined behavior is triggered.

I was just going to uncomfortably agree with you when I realised that I was wrong about the overflow.
[..]
There is no undefined behaviour for the adding shorts because arithmetic on shorts first does a conversion to an int and then adds the ints, and only at assignments do you do i.e.

extern short a,b;
short c = a + b;

is equivalent to
short c = (short)((int)a + (int)b);

Oops. Now that you mention it I remember this.. You are of course correct.

There is still a signed overflow when converting the int result back to a short. In C99 and C11 we find the following language (6.3.1.3):

Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.

But because this is implementation defined and not undefined it would indeed not be okay to violate the calling convention and the sign extension must be fixed with a slli+srai sequence.

That being said, I still don't think it is an urgent matter. A while back I disassembled the entire risc-v debian base image (debian-riscv64-tarball-20180418) and checked the (static) frequency of byte and half-word sign extension and zero extension sequences:

total 10489586
sextb 2198 (0.0210%)
sexth 595 (0.0057%)
zextb 150 (0.0014%)
zexth 4068 (0.0388%)

("total" is the total number of instructions in all binaries and shared objects included in that base image)

You can find the script I used for this analysis here: http://svn.clifford.at/handicraft/2019/rvlogicops

regards,

- Clifford

Rogier Brussee

unread,

Sep 21, 2019, 12:33:04 PM9/21/19

to RISC-V ISA Dev, rogier....@gmail.com

Op zaterdag 21 september 2019 14:14:58 UTC+2 schreef clifford:

Hi,

Hi Clifford,

[snip]

[..]
There is no undefined behaviour for the adding shorts because arithmetic on shorts first does a conversion to an int and then adds the ints, and only at assignments do you do i.e.

extern short a,b;
short c = a + b;

is equivalent to
short c = (short)((int)a + (int)b);

Oops. Now that you mention it I remember this.. You are of course correct.

There is still a signed overflow when converting the int result back to a short. In C99 and C11 we find the following language (6.3.1.3):

Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.

But because this is implementation defined and not undefined it would indeed not be okay to violate the calling convention and the sign extension must be fixed with a slli+srai sequence.

Right.

That being said, I still don't think it is an urgent matter. A while back I disassembled the entire risc-v debian base image (debian-riscv64-tarball-20180418) and checked the (static) frequency of byte and half-word sign extension and zero extension sequences:

total 10489586
sextb 2198 (0.0210%)
sexth 595 (0.0057%)
zextb 150 (0.0014%)
zexth 4068 (0.0388%)

("total" is the total number of instructions in all binaries and shared objects included in that base image)

That is very important data. However, it was not the Debian distribution I was worried about, but deeply embedded software that has been passed on for generations. In the post/proposal by John Hauser he mentions that shorts and (signed) bytes are used extensively in embedded and that suboptimal handling of bytes and halves was a barrier to adoption i.e. I only raised the issue, but this is something to check with people in embedded. I see supporting sign/zext.[hb] as an example of "support the C language efficiently and without nasty surprises, including software written sub optimally and not for RV" and I assumed that in this case, it is really not that expensive to do so.

Ciao and thanks,

Rogier

Clifford Wolf

unread,

Sep 22, 2019, 6:55:44 AM9/22/19

to Rogier Brussee, RISC-V ISA Dev

Moving the sext.h / sext.b discussion to https://github.com/riscv/riscv-bitmanip/issues/46.

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/4dfaf2a0-54b2-4b9f-8adc-89a172fba149%40groups.riscv.org.

lkcl

unread,

Sep 22, 2019, 8:50:59 AM9/22/19

to RISC-V ISA Dev, rogier....@gmail.com

On Sunday, September 22, 2019 at 11:55:44 AM UTC+1, clifford wrote:

Moving the sext.h / sext.b discussion to https://github.com/riscv/riscv-bitmanip/issues/46.

clifford, respectfully, again, may i gently remind you that there are people here (more than one) who do not have github accounts?

with english not being your first language, this may just be a language issue: "moving the discussion" would imply "please cease and desist further discussion here on this list", whereas saying "i have raised a special ticket to keep track of sext.h / sext.b, please feel free to discuss here or at that location" would indicate that both are welcome.

l.

lkcl

unread,

Sep 22, 2019, 9:49:27 AM9/22/19

to RISC-V ISA Dev, program...@gmail.com, bm1...@bristol.ac.uk, rogier....@gmail.com

On Friday, September 20, 2019 at 11:46:49 PM UTC+1, Allen Baum wrote:

Quickly:
There is a separate crypto extension that has a lot of work out into it. It may share some opcodes with bitmanip ( which means those ops are present if either extension is implemented). Or not.

sigh. and, here, as "outsiders", we have no constructive input or possibility of feedback, either way. this has to change.

Secondly: if you want to avoid power side channels, don’t bother trying to craft specially implemented opcodes; just add a crypto coprocessor that is specially designed to do that (Crrypography Research, a division of Rambus, has demonstrated this at several events and licenses the cores.

yehyeh, it's the only way to be absolutely sure. all other areas of computer science i'd be more than happy to develop some form of actual-RISC-V-instruction-extension. cryptography is the one area you *do not* mess about. the requirements are so diametrically opposed to the design goals of a general-purpose processor that it's not even funny.

l.

Rogier Brussee

unread,

Sep 22, 2019, 3:37:01 PM9/22/19

to RISC-V ISA Dev, bm1...@bristol.ac.uk, rogier....@gmail.com

Op vrijdag 20 september 2019 22:34:30 UTC+2 schreef Jacob Lifshay:

The question here is whether assuming you have ROR, do you also need ROL which would be just

ROL rd rs1 rs2 ;t # assume t != rs1

{

NEG t, rs2

ROR rd, rs1, t

}

Rogier

Jacob

lkcl

unread,

Sep 22, 2019, 9:18:45 PM9/22/19

to RISC-V ISA Dev, bm1...@bristol.ac.uk, rogier....@gmail.com

On Monday, September 23, 2019 at 3:37:01 AM UTC+8, Rogier Brussee wrote:

> The question here is whether assuming you have ROR, do you also need ROL which would be just
>
>
> ROL rd rs1 rs2 ;t # assume t != rs1
> {
> NEG t, rs2
> ROR rd, rs1, t
> }

Hi Rogier, Jacob's travelling at the moment so does not have easy computer access.

I looked up http://pcg-random.org and found the algorithm on wikipedia:
https://en.m.wikipedia.org/wiki/Permuted_congruential_generator

I am guessing here that what Jacob is pointing out is that the inner loop on PCG is so small that removing one or other of those ops and replacing it with a 2-op sequence would have a disproportionately large impact on performance.

With PRNGs being often also on critical paths requiring large PRNG output blocks in the shortest possible timespan, that may actually matter.

There is an academic paper on PCG at the link above.

L.

Rogier Brussee

unread,

Sep 23, 2019, 3:59:30 AM9/23/19

to RISC-V ISA Dev, bm1...@bristol.ac.uk, rogier....@gmail.com

Op maandag 23 september 2019 03:18:45 UTC+2 schreef lkcl:

On Monday, September 23, 2019 at 3:37:01 AM UTC+8, Rogier Brussee wrote:

> The question here is whether assuming you have ROR, do you also need ROL which would be just
>
>
> ROL rd rs1 rs2 ;t # assume t != rs1
> {
> NEG t, rs2
> ROR rd, rs1, t
> }

Hi Rogier, Jacob's travelling at the moment so does not have easy computer access.

I looked up http://pcg-random.org and found the algorithm on wikipedia:
https://en.m.wikipedia.org/wiki/Permuted_congruential_generator

I am guessing here that what Jacob is pointing out is that the inner loop on PCG is so small that removing one or other of those ops and replacing it with a 2-op sequence would have a disproportionately large impact on performance.

In the link you send, only ROR is used, which is sort of the point.

If people think a ROR with a negative rs2 is worth it, they should consider if

SLLNEG rd rs1 rs2; t #assume t != rs1

{

NEG t rs2

SLL rd rs1 t

}

and similarly SRANEG, SRLNEG. are worth being instructions.

Ciao

Rogier

lkcl

unread,

Sep 23, 2019, 4:47:12 AM9/23/19

to RISC-V ISA Dev, bm1...@bristol.ac.uk, rogier....@gmail.com

On Monday, September 23, 2019 at 8:59:30 AM UTC+1, Rogier Brussee wrote:

In the link you send, only ROR is used, which is sort of the point.

yes i spotted that as well. honestly can't say, rogier, we'll have to wait for Jacob to hit a keyboard :)

l.

Rogier Brussee

unread,

Sep 23, 2019, 4:49:45 AM9/23/19

to RISC-V ISA Dev, rogier....@gmail.com

Op zondag 22 september 2019 12:55:44 UTC+2 schreef clifford:

Moving the sext.h / sext.b discussion to https://github.com/riscv/riscv-bitmanip/issues/46.

Thanks!

Rogier

To unsubscribe from this group and stop receiving emails from it, send an email to isa...@groups.riscv.org.

Bruce Hoult

unread,

Sep 23, 2019, 12:44:22 PM9/23/19

to Rogier Brussee, RISC-V ISA Dev, bm1...@bristol.ac.uk

On Mon, Sep 23, 2019 at 12:59 AM Rogier Brussee
<rogier....@gmail.com> wrote:
>
> Op maandag 23 september 2019 03:18:45 UTC+2 schreef lkcl:
>>
>> On Monday, September 23, 2019 at 3:37:01 AM UTC+8, Rogier Brussee wrote:
>>
>>
>> > The question here is whether assuming you have ROR, do you also need ROL which would be just
>> >
>> >
>> > ROL rd rs1 rs2 ;t # assume t != rs1
>> > {
>> > NEG t, rs2
>> > ROR rd, rs1, t
>> > }
>>
>> Hi Rogier, Jacob's travelling at the moment so does not have easy computer access.
>>
>> I looked up http://pcg-random.org and found the algorithm on wikipedia:
>> https://en.m.wikipedia.org/wiki/Permuted_congruential_generator
>>
>> I am guessing here that what Jacob is pointing out is that the inner loop on PCG is so small that removing one or other of those ops and replacing it with a 2-op sequence would have a disproportionately large impact on performance.
>>
>
> In the link you send, only ROR is used, which is sort of the point.

Yes, I saw that.

But also, by definition the rotate amount is random, so it doesn't
actually matter whether it's left or right.

lkcl

unread,

Sep 24, 2019, 12:00:01 PM9/24/19

to RISC-V ISA Dev, rogier....@gmail.com, bm1...@bristol.ac.uk

https://www.meetup.com/Bay-Area-RISC-V-Meetup/events/264817954/

briefly, if anyone's interested in cryptography there's a bay area RISC-V meetup on the topic.

l.

Jacob Lifshay

unread,

Sep 25, 2019, 10:06:40 PM9/25/19

to Bruce Hoult, Rogier Brussee, RISC-V ISA Dev, Ben Marshall

On Mon, Sep 23, 2019 at 10:44 AM Bruce Hoult <bruce...@sifive.com> wrote:
>
> On Mon, Sep 23, 2019 at 12:59 AM Rogier Brussee
> <rogier....@gmail.com> wrote:
> >
> > Op maandag 23 september 2019 03:18:45 UTC+2 schreef lkcl:
> >>
> >> On Monday, September 23, 2019 at 3:37:01 AM UTC+8, Rogier Brussee wrote:
> >>
> >>
> >> > The question here is whether assuming you have ROR, do you also need ROL which would be just
> >> >
> >> >
> >> > ROL rd rs1 rs2 ;t # assume t != rs1
> >> > {
> >> > NEG t, rs2
> >> > ROR rd, rs1, t
> >> > }
> >>
> >> Hi Rogier, Jacob's travelling at the moment so does not have easy computer access.

Yeah, sorry about the delay.

> >>
> >> I looked up http://pcg-random.org and found the algorithm on wikipedia:
> >> https://en.m.wikipedia.org/wiki/Permuted_congruential_generator
> >>
> >> I am guessing here that what Jacob is pointing out is that the inner loop on PCG is so small that removing one or other of those ops and replacing it with a 2-op sequence would have a disproportionately large impact on performance.

Yes, that's the case.

> >>
> >
> > In the link you send, only ROR is used, which is sort of the point.

Yes, I concede that only one or the other is needed, though if there's
readily available space, I think both left and right rotate should be
supported for both symmetry as well as there are likely to be
additional algorithms both now and in the future that have similar
requirements for variable rotates, but with different directions. The
additional hardware to invert the rotation direction should be quite
small (5-bit negate for RV32), so not much of an issue.

If we do decide to have only one rotate direction, I think we should
pick the direction that is a single instruction to match PCG since it
is intended to generate random numbers very quickly, rotate
instructions would have a disproportionately large impact due to the
short inner loop, and PCG is becoming commonly used.

>
> Yes, I saw that.
>
> But also, by definition the rotate amount is random, so it doesn't
> actually matter whether it's left or right.

It does matter, because people often pick a particular algorithm so
they can obtain the same random sequence cross-platform (eg.
synchronizing random events in a networked game).

Jacob Lifshay

Rogier Brussee

unread,

Sep 26, 2019, 5:34:11 PM9/26/19

to RISC-V ISA Dev, bruce...@sifive.com, rogier....@gmail.com, bm1...@bristol.ac.uk

Op donderdag 26 september 2019 04:06:40 UTC+2 schreef Jacob Lifshay:

On Mon, Sep 23, 2019 at 10:44 AM Bruce Hoult <bruce...@sifive.com> wrote:
>
> On Mon, Sep 23, 2019 at 12:59 AM Rogier Brussee
> <rogier...@gmail.com> wrote:
> >
> > Op maandag 23 september 2019 03:18:45 UTC+2 schreef lkcl:
> >>
> >> On Monday, September 23, 2019 at 3:37:01 AM UTC+8, Rogier Brussee wrote:
> >>
> >>
> >> > The question here is whether assuming you have ROR, do you also need ROL which would be just
> >> >
> >> >
> >> > ROL rd rs1 rs2 ;t # assume t != rs1
> >> > {
> >> > NEG t, rs2
> >> > ROR rd, rs1, t
> >> > }
> >>
> >> Hi Rogier, Jacob's travelling at the moment so does not have easy computer access.

Yeah, sorry about the delay.

People have lives :-)

> >>
> >> I looked up http://pcg-random.org and found the algorithm on wikipedia:
> >> https://en.m.wikipedia.org/wiki/Permuted_congruential_generator
> >>
> >> I am guessing here that what Jacob is pointing out is that the inner loop on PCG is so small that removing one or other of those ops and replacing it with a 2-op sequence would have a disproportionately large impact on performance.

Yes, that's the case.

> >>
> >
> > In the link you send, only ROR is used, which is sort of the point.

Yes, I concede that only one or the other is needed, though if there's
readily available space, I think both left and right rotate should be
supported for both symmetry as well as there are likely to be
additional algorithms both now and in the future that have similar
requirements for variable rotates, but with different directions. The
additional hardware to invert the rotation direction should be quite
small (5-bit negate for RV32), so not much of an issue.

From a consistency point of view, I would say: either have a NEG version

for all shift like operations (SLL<-> SLLNEG ,SLA <-> SLANEG,

SRA <->SRANEG, ROR<->ROL), or none at all. It is not a terribly big deal

though.

If we do decide to have only one rotate direction, I think we should
pick the direction that is a single instruction to match PCG since it
is intended to generate random numbers very quickly, rotate
instructions would have a disproportionately large impact due to the
short inner loop, and PCG is becoming commonly used.

The Wikipedia article comes with a ROR (as opposed to ROL) version.

I have not checked, but I can hardly imagine the algorithm working with

ROR and not ROL or vice versa perhaps with some modifications on constants.

>
> Yes, I saw that.
>
> But also, by definition the rotate amount is random, so it doesn't
> actually matter whether it's left or right.

It does matter, because people often pick a particular algorithm so
they can obtain the same random sequence cross-platform (eg.
synchronizing random events in a networked game).

That is a point.

Thanks!

Rogier.

Jacob Lifshay

Jacob Lifshay

unread,

Sep 29, 2019, 2:33:17 AM9/29/19

to Rogier Brussee, RISC-V ISA Dev, Bruce Hoult, Ben Marshall

On Thu, Sep 26, 2019 at 2:34 PM Rogier Brussee <rogier....@gmail.com> wrote:
>
> Op donderdag 26 september 2019 04:06:40 UTC+2 schreef Jacob Lifshay:
>>

>> Yes, I concede that only one or the other is needed, though if there's
>> readily available space, I think both left and right rotate should be
>> supported for both symmetry as well as there are likely to be
>> additional algorithms both now and in the future that have similar
>> requirements for variable rotates, but with different directions. The
>> additional hardware to invert the rotation direction should be quite
>> small (5-bit negate for RV32), so not much of an issue.
>
>
> From a consistency point of view, I would say: either have a NEG version
> for all shift like operations (SLL<-> SLLNEG ,SLA <-> SLANEG,
> SRA <->SRANEG, ROR<->ROL), or none at all. It is not a terribly big deal
> though.

(assuming you meant SRL[NEG] not SLA[NEG])

I was looking at it from the viewpoint of wanting left-right symmetry
rather than neg symmetry, since otherwise rotate would be the only
instruction that doesn't have left-right symmetry, neg just happens to
be a way to implement left/right rotate in terms of right/left rotate.

Left-right symmetry:
SLL <-> SRL
SLL <-> SRA (would actually be SLA <-> SRA, but SLA is redundant)
ROL <-> ROR

>> If we do decide to have only one rotate direction, I think we should
>> pick the direction that is a single instruction to match PCG since it
>> is intended to generate random numbers very quickly, rotate
>> instructions would have a disproportionately large impact due to the
>> short inner loop, and PCG is becoming commonly used.
>>
>
> The Wikipedia article comes with a ROR (as opposed to ROL) version.
> I have not checked, but I can hardly imagine the algorithm working with
> ROR and not ROL or vice versa perhaps with some modifications on constants.

It produces a different output but otherwise works just fine, assuming
the rotate is the last operation (since a bit-permutation of the
output of a random bit generator should also be random -- pretty sure
BigCrush checks for that kind of thing).
It may be possible to modify the constants to produce the same output
sequence, but that would depend on the particular variant of PCG.

Jacob Lifshay

Rogier Brussee

unread,

Sep 29, 2019, 7:02:08 AM9/29/19

to RISC-V ISA Dev, rogier....@gmail.com, bruce...@sifive.com, bm1...@bristol.ac.uk

Op zondag 29 september 2019 08:33:17 UTC+2 schreef Jacob Lifshay:

On Thu, Sep 26, 2019 at 2:34 PM Rogier Brussee <rogier...@gmail.com> wrote:
>
> Op donderdag 26 september 2019 04:06:40 UTC+2 schreef Jacob Lifshay:
>>
>> Yes, I concede that only one or the other is needed, though if there's
>> readily available space, I think both left and right rotate should be
>> supported for both symmetry as well as there are likely to be
>> additional algorithms both now and in the future that have similar
>> requirements for variable rotates, but with different directions. The
>> additional hardware to invert the rotation direction should be quite
>> small (5-bit negate for RV32), so not much of an issue.
>
>
> From a consistency point of view, I would say: either have a NEG version
> for all shift like operations (SLL<-> SLLNEG ,SLA <-> SLANEG,
> SRA <->SRANEG, ROR<->ROL), or none at all. It is not a terribly big deal
> though.

(assuming you meant SRL[NEG] not SLA[NEG])

Sorry yes,

I was looking at it from the viewpoint of wanting left-right symmetry
rather than neg symmetry, since otherwise rotate would be the only
instruction that doesn't have left-right symmetry, neg just happens to
be a way to implement left/right rotate in terms of right/left rotate.

Left-right symmetry:
SLL <-> SRL
SLL <-> SRA (would actually be SLA <-> SRA, but SLA is redundant)
ROL <-> ROR

The left right symmetry is only superficially the right analogue because

the SLL and SRL/SRA instructions cannot be reduced to each other (using NEG or NOT or ..).

I apologise in advance if the below is obvious to you and I am overly verbose, but it seems

I don't get my point across.

Let me explain by comparing the immediate versions:

Q; Why does nobody propose a ROLI rd rs1 imm instruction ?

A: because it is functionally equivalent to

RORI rd rs1 (XLEN - imm)U

which because of modular arithmetic mod XLEN is equivalent to

RORI rd rs (-imm)U

Since for immediates we can just let the compiler compute that (-imm) (or XLEN - imm),

a ROLI instruction is redundant.

The register versions ROL and ROR are not redundant:

ROR rd rs1 rs2

and

ROL rd rs1 rs2

are different instructions. However, if you have one, the other just replaces

a two instruction sequence. E.g. if we have ROR then the macro/compiler generated sequence

ROL_MACRO rd rs1 rs2; t # assume t != rs1

{

NEG t, rs2

ROR rd, rs1, t

}

could be used instead of a ROL instruction.

Now consider a hypothetical instruction SLLNEGI rd rs1 rs2 with semantics

SLLNEGI rd rs1 imm

with the semantics of

SLLI rd rs1 (XLEN - imm)U

(and similarly for SRLNEGI <-> SRLI and SRANEGI <-> SRAI).

Again because of modular arithmetic mod XLEN, which applies because in RV

shifts have an rs2/immediate that is subject to modular arithmetic mod XLEN

(i.e. rs2 &= (XLEN-1) ), this is equivalent to

SLLI rd rs1 (-imm)U.

Thus, a SLLNEGI instruction would be completely redundant just like ROLI is. It is not a useless concept, however:

expressions like this crop up all the time in sign extension, extraction of bitfields, the sequence needed

to emulate funnel shifts, or even rotations if you don't want ROR[I] as a separate instruction.

The corresponding register versions

SLL rd rs1 rs2

and

SLLNEG rd rs1 rs2

are not equivalent, but just like ROL and ROR, are different instructions. However, just as in the case of ROL they only replace

a two instruction macro/compiler generated sequence

SLLNEG_MACRO rd rs1 rs2; t # assume t != rs1

{

NEG t, rs2

SLL rd, rs1, t

}

Now while a NEG version would not seem to give a huge benefit, a negation (minus) on rs2 is already there for SUB. Hence, having ROL as well as ROR is probably cheap. But then it should be equally cheap, perhaps even cheaper, to also have a NEG version of other shifts, i.e. SLLNEG, SRLNEG and SRANEG. The cost will be dominated by decoding so largely depends on encoding details which may change in case SLO and friends get dropped. Shifts are more prevalent than rotates, so if ROL seems worth it, that suggests so are SLLNEG, SRLNEG and SRANEG, if only to make the register version of the patterns for bit field extraction shorter. But frankly, I doubt a case can be made for any of them because the register versions of shifts/rotates are are so rare.

It seems x86 has a bewildering array of ROR's and ROL's, but ARM has just a ROR. Thus, it seems likely that primitives like the PCG random number generator which would work with both ROL and ROR, would gravitate towards picking ROR. YMMV.

Ciao,

Rogier

Jacob Lifshay

unread,

Sep 29, 2019, 1:25:59 PM9/29/19

to Rogier Brussee, RISC-V ISA Dev, Bruce Hoult, bm1...@bristol.ac.uk

I don't mind, verbosity is better than too terse. I was under the impression that riscv would be the only widely used isa that didn't have both rol and ror instructions.

Let me explain by comparing the immediate versions:

Q; Why does nobody propose a ROLI rd rs1 imm instruction ?

A: because it is functionally equivalent to

RORI rd rs1 (XLEN - imm)U

which because of modular arithmetic mod XLEN is equivalent to

RORI rd rs (-imm)U

Since for immediates we can just let the compiler compute that (-imm) (or XLEN - imm),
a ROLI instruction is redundant.

This is very similar to why there isn't a SLA instruction: because it's redundant with SLL (interestingly, x86 has an encoding for both of them but most disassemblers pick one or the other mnemonic for both encodings)

The register versions ROL and ROR are not redundant:

ROR rd rs1 rs2

and

ROL rd rs1 rs2

are different instructions. However, if you have one, the other just replaces
a two instruction sequence. E.g. if we have ROR then the macro/compiler generated sequence

ROL_MACRO rd rs1 rs2; t # assume t != rs1
{
NEG t, rs2
ROR rd, rs1, t
}

could be used instead of a ROL instruction.

Yup, got all that.

Now consider a hypothetical instruction SLLNEGI rd rs1 rs2 with semantics

SLLNEGI rd rs1 imm

with the semantics of

SLLI rd rs1 (XLEN - imm)U

(and similarly for SRLNEGI <-> SRLI and SRANEGI <-> SRAI).
Again because of modular arithmetic mod XLEN, which applies because in RV
shifts have an rs2/immediate that is subject to modular arithmetic mod XLEN
(i.e. rs2 &= (XLEN-1) ), this is equivalent to

SLLI rd rs1 (-imm)U.

Thus, a SLLNEGI instruction would be completely redundant just like ROLI is. It is not a useless concept, however:
expressions like this crop up all the time in sign extension, extraction of bitfields, the sequence needed
to emulate funnel shifts, or even rotations if you don't want ROR[I] as a separate instruction.

The corresponding register versions

SLL rd rs1 rs2

and

SLLNEG rd rs1 rs2

are not equivalent, but just like ROL and ROR, are different instructions. However, just as in the case of ROL they only replace
a two instruction macro/compiler generated sequence

SLLNEG_MACRO rd rs1 rs2; t # assume t != rs1
{
NEG t, rs2
SLL rd, rs1, t
}

Now while a NEG version would not seem to give a huge benefit, a negation (minus) on rs2 is already there for SUB. Hence, having ROL as well as ROR is probably cheap.

Not really, the minus is generally integrated into the add alu, implemented as rs1 + ~rs2 + 1, only bitwise not on rs2 is separable. A separate neg alu would usually be required or the low bits of the add/sub alu could be used, though I was assuming that most high-performance implementations would have a barrel shifter that can do both left and right rotate, maybe by doing an optional bit-reverse at the beginning and end, which would also eliminate the need for sll since srl is the bit-reversed version.

But then it should be equally cheap, perhaps even cheaper, to also have a NEG version of other shifts, i.e. SLLNEG, SRLNEG and SRANEG. The cost will be dominated by decoding so largely depends on encoding details which may change in case SLO and friends get dropped. Shifts are more prevalent than rotates, so if ROL seems worth it, that suggests so are SLLNEG, SRLNEG and SRANEG, if only to make the register version of the patterns for bit field extraction shorter. But frankly, I doubt a case can be made for any of them because the register versions of shifts/rotates are are so rare.

It seems x86 has a bewildering array of ROR's and ROL's, but ARM has just a ROR. Thus, it seems likely that primitives like the PCG random number generator which would work with both ROL and ROR, would gravitate towards picking ROR. YMMV.

I was not aware that other architectures had only 1 rotate direction. spending a little time with rust.godbolt.org:

x86 has both rol and ror (as well as rcl and rcr, but those are very rarely used)

arm/aarch64 has ror.

mips has ror.

s390 has rol.

powerpc has rol.

webassembly has both rol and ror.

sparc has neither.

so, I think we should have a ror instruction.

Jacob

Rogier Brussee

unread,

Sep 30, 2019, 3:20:26 AM9/30/19

to RISC-V ISA Dev, rogier....@gmail.com, bruce...@sifive.com, bm1...@bristol.ac.uk

[snip]

Not really, the minus is generally integrated into the add alu, implemented as rs1 + ~rs2 + 1, only bitwise not on rs2 is separable. A separate neg alu would usually be required or the low bits of the add/sub alu could be used, though I was assuming that most high-performance implementations would have a barrel shifter that can do both left and right rotate, maybe by doing an optional bit-reverse at the beginning and end, which would also eliminate the need for sll since srl is the bit-reversed version.

Ah yes, I guess I knew that but didn't think of It :-(. Makes _all_ the NEG versions (including ROL) even less attractive.

It seems x86 has a bewildering array of ROR's and ROL's, but ARM has just a ROR. Thus, it seems likely that primitives like the PCG random number generator which would work with both ROL and ROR, would gravitate towards picking ROR. YMMV.

I was not aware that other architectures had only 1 rotate direction. spending a little time with rust.godbolt.org:

x86 has both rol and ror (as well as rcl and rcr, but those are very rarely used)
arm/aarch64 has ror.
mips has ror.
s390 has rol.
powerpc has rol.
webassembly has both rol and ror.
sparc has neither.

so, I think we should have a ror instruction.

I agree.

Rogier

Jacob

lkcl

unread,

Sep 30, 2019, 5:01:50 AM9/30/19

to RISC-V ISA Dev, rogier....@gmail.com, bruce...@sifive.com, bm1...@bristol.ac.uk

SV needs a full set of direct-equivalent bitmanip opcodes for full peer functionality.

i did a cross-reference check of the opcodes listed in RVV section 16.1 "Vector Mask-Register Logical instructions", and there are eight in total:

vmand.mm vd, vs2, vs1 # vd[i] = vs2[i].LSB && vs1[i].LSB

vmnand.mm vd, vs2, vs1 # vd[i] = !(vs2[i].LSB && vs1[i].LSB)

vmandnot.mm vd, vs2, vs1 # vd[i] = vs2[i].LSB && !vs1[i].LSB

vmxor.mm vd, vs2, vs1 # vd[i] = vs2[i].LSB ^^ vs1[i].LSB

vmor.mm vd, vs2, vs1 # vd[i] = vs2[i].LSB || vs1[i].LSB

vmnor.mm vd, vs2, vs1 # vd[i] = !(vs2[i[.LSB || vs1[i].LSB)

vmornot.mm vd, vs2, vs1 # vd[i] = vs2[i].LSB || !vs1[i].LSB

vmxnor.mm vd, vs2, vs1 # vd[i] = !(vs2[i].LSB ^^ vs1[i].LSB)

scalar RISC-V has three of these:

AND rd, rs1, rs2 # rd = rs1 & rs2

OR rd, rs1, rs2 # rd = rs1 | rs2

XOR rd, rs1, rs2 # rd = rs1 ^ rs2

and the ones in BitManip-Draft 0.91 that correspond to equivalents in RVV are:

ANDN rd, rs1, rs2 # rd = rs1 & ~rs2

ORN rd, rs1, rs2 # rd = rs1 | ~rs2

XORN rd, rs1, rs2 # rd = rs1 ^ ~rs2

that leaves:

NOR

NAND

as missing, because they're listed as pseudo-ops:

2.13.9 Pseudo-ops for fused sequences

Assembler pseudo-ops for not postfix fusion:

nand rd, rs1, rs2 -> and rd, rs1, rs2; not rd, rd

nor rd, rs1, rs2 -> or rd, rs1, rs2; not rd, rd

so the question really becomes: why are NOR and NAND included in RVV, but are pseudo-ops in BitManip?

l.

lkcl

unread,

Oct 3, 2019, 11:24:58 AM10/3/19

to RISC-V ISA Dev, rogier....@gmail.com, bruce...@sifive.com, bm1...@bristol.ac.uk

Ternary operations are so ultra expensive, literally taking up an entire major opcode each (minus 3 bits) that the inclusion of so many of them in BitManip has me concerned.

One potential solution is to use RVC register numbering or similar. 10 bits can be used to store 3 registers 3,3,4. If encoding across all 3 fields rd, rs1 and rs2, that is 15 bits which could be allocated 4, 4, 4, 3

It is not ideal however neither is taking up the entire OP32 space with Ternary operations each requiring 4 registers.

A little thought may find an overlap with existing RVC encoding.

The other alternative is a destructive ternary operation which uses one of the src operands as a dest. This is less than ideal too however, again, none of the options are.

Thoughts welcomed as to the benefits of each. Are there any other options?

L.

Matthew Farkas-Dyck

unread,

Oct 3, 2019, 1:46:05 PM10/3/19

to lkcl, RISC-V ISA Dev

On 03/10/2019, lkcl <luke.l...@gmail.com> wrote:
> Thoughts welcomed as to the benefits of each. Are there any other options?

Yes, long instruction encoding

This is one reason to be wary of ternary ops for sure. It's
unfortunate the F/D extension wastes so much opcode space with FMA
while FNMADD, FNMSUB could be done with macro op fusion.

I also note CMOV can be done in terms of CMIX and SLT+NEG, the latter
which i think may deserve its own opcode, as it seems quite useful in
bit-twiddling code [0] and could be more readily fused. But i'll leave
that to folks with more experience...

[0] For another example, min x y = y ^ ((x ^ y) & -(x < y))

lkcl

unread,

Oct 3, 2019, 2:00:35 PM10/3/19

to RISC-V ISA Dev, luke.l...@gmail.com

On Thursday, October 3, 2019 at 6:46:05 PM UTC+1, Matthew Farkas-Dyck wrote:

On 03/10/2019, lkcl <luke.l...@gmail.com> wrote:
> Thoughts welcomed as to the benefits of each. Are there any other options?

Yes, long instruction encoding

oh, duh, good point :) my only concern with those - there's no official extension that's gone into 32+ bit territory. therefore the "design" cost of using 48/64/80+ has to be added.

This is one reason to be wary of ternary ops for sure. It's
unfortunate the F/D extension wastes so much opcode space with FMA
while FNMADD, FNMSUB could be done with macro op fusion.

yeh i didn't realise FMA took up an entire major opcode until doing the FP ops allocation page

https://libre-riscv.org/rv_major_opcode_1010011/

evidently the price was considered worth paying.

I also note CMOV can be done in terms of CMIX and SLT+NEG, the latter
which i think may deserve its own opcode, as it seems quite useful in
bit-twiddling code [0] and could be more readily fused. But i'll leave
that to folks with more experience...

[0] For another example, min x y = y ^ ((x ^ y) & -(x < y))

oo niiice. i do love tricks like this :)

l.

Bruce Hoult

unread,

Oct 3, 2019, 4:48:45 PM10/3/19

to lkcl, RISC-V ISA Dev, Rogier Brussee, bm1...@bristol.ac.uk

On Fri, Oct 4, 2019 at 4:24 AM lkcl <luke.l...@gmail.com> wrote:
>
> Ternary operations are so ultra expensive, literally taking up an entire major opcode each (minus 3 bits) that the inclusion of so many of them in BitManip has me concerned.

*Nothing* is included in BitManip at this point. Some people have said
they want certain ternary ops, so they are being *evaluated* to see if
they are worth the very large cost. My answer to that is they are not,
and I believe Clifford thinks the same, at least for general purpose
processors.

Some people in embedded applications may not ever want to use that
opcode space for anything else and therefore be happy to pay that
encoding space cost and also the cost of a 3rd read port (per
execution pipe, for superscalar) on the integer register file. To
support them we should standardize on the opcodes so that the people
who want them can have assembler and compiler support.

It can even be ok if those opcodes overlap some other extension as
long as those extensions are marked as mutually incompatible.

> One potential solution is to use RVC register numbering or similar. 10 bits can be used to store 3 registers 3,3,4. If encoding across all 3 fields rd, rs1 and rs2, that is 15 bits which could be allocated 4, 4, 4, 3

We're trying to avoid that because it ruins one of the nice
implementation properties of the 32 bt instruction set that register
access can start before you even know what instruction you are looking
at. It also won't help the problem of needing three read ports.

Bruce Hoult

unread,

Oct 3, 2019, 4:51:21 PM10/3/19

to Matthew Farkas-Dyck, lkcl, RISC-V ISA Dev

On Fri, Oct 4, 2019 at 6:46 AM Matthew Farkas-Dyck <stra...@gmail.com> wrote:
> This is one reason to be wary of ternary ops for sure. It's
> unfortunate the F/D extension wastes so much opcode space with FMA
> while FNMADD, FNMSUB could be done with macro op fusion.

The "fused" ops produce slightly different results (and must do so)
than doing the two operations individually.

Bruce Hoult

unread,

Oct 3, 2019, 4:53:19 PM10/3/19

to lkcl, RISC-V ISA Dev

On Fri, Oct 4, 2019 at 7:00 AM lkcl <luke.l...@gmail.com> wrote:
> yeh i didn't realise FMA took up an entire major opcode until doing the FP ops allocation page
> https://libre-riscv.org/rv_major_opcode_1010011/
>
> evidently the price was considered worth paying.

In much or most floating point code, FMA and friends are *the* most
common FP operation.

lkcl

unread,

Oct 3, 2019, 5:11:11 PM10/3/19

to RISC-V ISA Dev, stra...@gmail.com, luke.l...@gmail.com

Oh. Yes. There's something specially in the OpenCL and Vulkan specs about this. Some less accurate situations permit FMUL and FADD to be used and there is an OpenCL extended opcode to cover that.
L.

lkcl

unread,

Oct 3, 2019, 5:15:35 PM10/3/19

to RISC-V ISA Dev, luke.l...@gmail.com

On Friday, October 4, 2019 at 4:53:19 AM UTC+8, Bruce Hoult wrote:

> In much or most floating point code, FMA and friends are *the* most
> common FP operation.

Makes sense.

Rogier came up with some nice bitmanip macro op fusion sequences for ternary ops.

Keeping RV clean (no 4,3,3) sensible.

Need to go through all options for completeness. Miss something, kick ourselves when it's too late.

L.

Matthew Farkas-Dyck

unread,

Oct 3, 2019, 10:26:15 PM10/3/19

to lkcl, Bruce Hoult, RISC-V ISA Dev

On 03/10/2019, Bruce Hoult <bruce...@sifive.com> wrote:
> On Fri, Oct 4, 2019 at 6:46 AM Matthew Farkas-Dyck <stra...@gmail.com> wrote:

>> This is one reason to be wary of ternary ops for sure. It's
>> unfortunate the F/D extension wastes so much opcode space with FMA
>> while FNMADD, FNMSUB could be done with macro op fusion.
>

> The "fused" ops produce slightly different results (and must do so)
> than doing the two operations individually.

To clarify: i meant the following:
FNMADD x, a, b = FMADD x, a, b; FNEG x, x
FNMSUB x, a, b = FMSUB x, a, b; FNEG x, x
which should be exactly equivalent, to my knowledge.

On 03/10/2019, lkcl <luke.l...@gmail.com> wrote:

> Rogier came up with some nice bitmanip macro op fusion sequences for ternary
> ops.

Can it be done for CMIX or funnel shift? That would certainly be
another strike against them. (I would not consider not having a
fusible sequence a strong enough point in their favor to include them
in the core B extension, however, given the burden they impose on the
microarchitecture and the ISA.)

lkcl

unread,

Oct 4, 2019, 12:58:27 AM10/4/19

to RISC-V ISA Dev, luke.l...@gmail.com, bruce...@sifive.com

On Friday, October 4, 2019 at 10:26:15 AM UTC+8, Matthew Farkas-Dyck wrote:

> To clarify: i meant the following:
> FNMADD x, a, b = FMADD x, a, b; FNEG x, x
> FNMSUB x, a, b = FMSUB x, a, b; FNEG x, x
> which should be exactly equivalent, to my knowledge.

On the face of it, it seems reasonable, the problem is the NaN interaction, rounding and so on produce different answers. John Hauser's work is the best reference here, in hardfloat and softfloat.

hardfloat-verilog is around 250 lines of the most stunningly densely packed and yet still readable logic I have ever seen (most verilog that dense is flat out unreadable).

The core of the issue is that the intermediary register between the MUL and the ADD loses precision: rounding is performed *twice*.

Anyway.

OT :)

Rogier knows the macro ops, will leave it to him to reply.

L.

Bruce Hoult

unread,

Oct 4, 2019, 1:12:19 AM10/4/19

to lkcl, RISC-V ISA Dev

+0.0 vs -0.0 can also be tricky to get right.

I'm not actually sure whether this particular example with FNEG is ok or not.

Jacob Lifshay

unread,

Oct 4, 2019, 1:26:41 AM10/4/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Bruce Hoult

the suggested macro-op fusion fuses a muladd with a neg, not a mul with an add.

it will round differently:

suppose the rounding mode is round towards +inf:

the muladd will round towards +inf, then the neg will reverse the effective rounding direction which will produce the effect of rounding towards -inf.

none of the rounding modes work in all cases due to how signed zeros are added.

Jacob

Andrew Waterman

unread,

Oct 4, 2019, 12:11:58 PM10/4/19

to Bruce Hoult, RISC-V ISA Dev, lkcl

No, alas, you need to negate one of the inputs instead of the output. In that case, fusion is only easy if you can destroy one of the inputs.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAP8PnuTgQP1%2BfORpO6SXyVBkY_OfuUqjfRs02%3D%2B_2%3DTWHKpm%3Dw%40mail.gmail.com.

Rogier Brussee

unread,

Oct 4, 2019, 12:19:24 PM10/4/19

to RISC-V ISA Dev, luke.l...@gmail.com, rogier....@gmail.com, bm1...@bristol.ac.uk

Op donderdag 3 oktober 2019 22:48:45 UTC+2 schreef Bruce Hoult:

On Fri, Oct 4, 2019 at 4:24 AM lkcl <luke.l...@gmail.com> wrote:
>
> Ternary operations are so ultra expensive, literally taking up an entire major opcode each (minus 3 bits) that the inclusion of so many of them in BitManip has me concerned.

*Nothing* is included in BitManip at this point. Some people have said
they want certain ternary ops, so they are being *evaluated* to see if
they are worth the very large cost. My answer to that is they are not,
and I believe Clifford thinks the same, at least for general purpose
processors.

Some people in embedded applications may not ever want to use that
opcode space for anything else and therefore be happy to pay that
encoding space cost and also the cost of a 3rd read port (per
execution pipe, for superscalar) on the integer register file. To
support them we should standardize on the opcodes so that the people
who want them can have assembler and compiler support.

It can even be ok if those opcodes overlap some other extension as
long as those extensions are marked as mutually incompatible.

The Suggestions for replacements of ternary operations that Luke mentioned:

Its in the review document: but to wit:

MIX can be done with the three instructions

MIX rd rs1 rs2 rs3# rs1 = rs1&rs2 | (~rs1)&rs3 == rs1&(rs2 ^ rs3) ^ rs3

{

#assume rd != rs1, rs3

XOR rd, rs2, rs3

AND rd, rd, rs1

XOR rd rd rs3

}

which can be fused with care of the assumptions.

The compiler has freedom to avoid the assumptions

MIX2 rd rs1 rs2 rs3# rs1 = rs1&rs2 | (~rs1)&rs3 == rs1&~(rs2 ^ rs3) ^ rs2

{

#assume rd != rs1, rs2

XOR rd, rs2, rs3

ANDN rd, rd, rs1

XOR. rd rd rs2

}

or even

MIX3 rd rs1 rs2 rs3# rd = rs1&rs2 | (~rs1)&rs3 == rs1&~(rs2 ^ rs3) ^ rs2

{

#assume rd == rs1, rd !=rs2, rs3

XOR rs2, rs2, rs3

AND rd, rs1, rs2 #rd == rs1

XOR rd, rd, rs3

XOR rs2, rs2, rs3

}

Various degenerate cases such rs1 == rs2 just reduce to OR, AND and MV, so one can, (if one wants) define a

macro that covers all cases.

Likewise for CMOV we can define special 2 input 1 output regiester versions of of CMOV

MIN[U], MAX[U] cover lots of cases (and are really just cheap variations of SLT and SLTU, see the Verilog for the PULPINO processor).

There is the special case of the Elvis operator ?:

ELVIS rd, rs1, rs2 : rd = rs1?: rs2 == (rs1)?rs1: rs2 == (rs1? (rs1 ^ rs2): 0) ^ rs2 == ( (-!!rs1)&(rs1 ^ rs2) )^ rs2 == (rs1) | ((-!rs1)&rs2)

This is actually an operator in gnu C/C++ and other programming languages, and it

covers cases like replacing a NULL value by something else. Note that implementation wise, it is just a cheap ALU instruction of R-type.

There is the special case of CMOV with rs3 == zero

AND.NE rd rs1 rs2 # rd = rs1?rs2:0 == (-!!rs1)& rs2

(alternative name: CMOVZ (conditional move or zero), but it is just a bit operation and also implements logical AND and changing a 0/1 value in 0/bitmask.

One can of course put the (-!!) be on rs2 instead of rs1 without changing anything important)

Beside interest of its own, it can be used to emulate the general case with

CMOV1 rd rs1 rs2 rs3. # rs1 ? rs2 : rs3 == (rs1? (rs2 ^ rs3) : 0) ^ rs3 assume rd != rs1, rs3
{
XOR rd, rs2, rs3, assume rd != rs1, rs3
AND.NE rd, rs1, rd
XOR rd, rd, rs3
}

Again this a fusable sequence if you are careful about the assumptions.

(with an additional instruction

AND.EQ rd rs1 rs2 # rd = (!rs1)?rs2: 0 == (-!rs1)& rs2 == (~ -!!rs1)& rs2

you could do something similar to MIX2 but I am not sure it is worth it. Even without that you can do the analogue of MIX3.)

If you really really want to, with some care, you can define a macro that does all the special cases of CMOV

(IIRC this requires ELVIS if you want to do it without an extra register)

For the funnel shifts I think the important case is the immediate case. This can be done in 3 instructions but

not without an extra register so the sequence is not technically fusable.

FSLI rd rs1 rs2 imm; t # assume t != rd rd != rs2

{

if (imm & (XLEN -1) < 2*XLEN){ # Equivalently if (imm & (1 << XLEN) == 0 )

SLLI rd, rs1, imm & (XLEN -1)

SRLI t, rs2, (-imm)&(XLEN - 1)

OR rd, rd, t

}else{

SLLI t, rs2, imm & (XLEN -1)

SRLI rd, rs1, (-imm)&(XLEN - 1)

OR rd, rd, t

}

with the (very rare) non immediate versions just taking the branch (with the test being checking bit XLEN) and doing the negation

(or having an SRLNEG but Jacob convinced me they are probably not worth it)

Rogier Brussee

unread,

Oct 4, 2019, 2:19:12 PM10/4/19

to RISC-V ISA Dev, luke.l...@gmail.com, rogier....@gmail.com, bm1...@bristol.ac.uk

[snip]

Beside interest of its own, it can be used to emulate the general case with

CMOV1 rd rs1 rs2 rs3. # rs1 ? rs2 : rs3 == (rs1? (rs2 ^ rs3) : 0) ^ rs3 assume rd != rs1, rs3
{
XOR rd, rs2, rs3, assume rd != rs1, rs3
AND.NE rd, rs1, rd
XOR rd, rd, rs3
}

Again this a fusable sequence if you are careful about the assumptions.

(with an additional instruction

AND.EQ rd rs1 rs2 # rd = (!rs1)?rs2: 0 == (-!rs1)& rs2 == (~ -!!rs1)& rs2

you could do something similar to MIX2 but I am not sure it is worth it. Even without that you can do the analogue of MIX3.)

While it is nice to have a fusable sequence, I can well imagine that if the sequence is not fused,

it is actually faster to use a sequence that uses one temporary register and write

CMOV_MACRO rd rs1 rs2 rs3; t # rd = rs1? rs2: rs3, assume t != rd, rs1, rs3,

{

AND.NE t, rs1, rs2

AND.EQ rd, rs1, rs3

OR rd, rd, t # can also use C.ADD

}

This is because the first two instructions can now be scheduled simultaneously so

we are down to 3 instructions (10 bytes) and two cycles.

Unfortunately whether this is a performance win depends on the platform.

Mutatis mutandis the same holds for MIX.

[snip]

Ciao

Rogier.

lkcl

unread,

Oct 4, 2019, 2:52:50 PM10/4/19

to RISC-V ISA Dev, rogier....@gmail.com, bruce...@sifive.com, bm1...@bristol.ac.uk

Rogier, the AND.NE etc make a lot of sense as macro op fused overlays to construct ternary ops. Two different CMOV sequences, both with different assumptions, could be used under mutually alternate circumstances.

Intuitively I can foresee that in a bitlevel fashion the same tricks using ANDN etc could be used to do ternary bit selection i.e. bit mixing from 2 sources.

A hypothetical NAND and NOR on the other hand, these are symmetrical and consequently lack the same "clout".

What *can* they be used for which AND/OR followed by NOT cannot achieve? Any loops where 3 or 4 instructions is a heavy penalty?

L.

lkcl

unread,

Oct 5, 2019, 3:15:52 AM10/5/19

to RISC-V ISA Dev, luke.l...@gmail.com

On Friday, October 4, 2019 at 5:15:35 AM UTC+8, lkcl wrote:

Makes sense.
Rogier came up with some nice bitmanip macro op fusion sequences for ternary ops.
Keeping RV clean (no 4,3,3) sensible.
Need to go through all options for completeness. Miss something, kick ourselves when it's too late.

Apologies for terse and obtuse language there, it was very late (4am or so).

Another option, just for completeness: destructive overwrite of one of the src operands, reducing the total regs needed down to three.

This trick is used in RVV FMAC as there is not enough space for 4 operands in the Vector OP32 format.

In the MUX operand case a logical choice for register to be overwritten would perhaps be the selector, it could be argued that overwriting the two src regs would bbe undesirable.

However if a structure is to be copied which involves multiple registers, destroying the selector is clearly bad.

No good options here basically.

Rogier Brussee

unread,

Oct 5, 2019, 5:42:43 AM10/5/19

to RISC-V ISA Dev, rogier....@gmail.com, bruce...@sifive.com, bm1...@bristol.ac.uk

Op vrijdag 4 oktober 2019 20:52:50 UTC+2 schreef lkcl:

Rogier, the AND.NE etc make a lot of sense as macro op fused overlays to construct ternary ops.

Two different CMOV sequences, both with different assumptions, could be used under mutually alternate circumstances.

That was the point I was trying to make for the analogue MIX case. For two different usable CMOV sequences or for the parallel schedulable/extra register version

you need AND.NE and AND.EQ.

If I understand correctly, the point of the CMOV instruction is to reduce the number of loops and in particular make better use of the branch prediction hardware

by not bothering it with something that emulates a CMOV. With AND.NE/ AND.EQ and XOR or OR/temporary you can achieve the same goal without needing

a architecturally expensive 3in 1 out instruction although at the cost of being slightly more instructions unless you can use AND.NE (or AND.EQ) or ELVIS directly.

Intuitively I can foresee that in a bitlevel fashion the same tricks using ANDN etc could be used to do ternary bit selection i.e. bit mixing from 2 sources.

See the macros for MIX I wrote or the macro or do you mean

MIX_MACRO_WITHTMP rd rs1 rs2 rs3; t # rd = rs1&rs2 | (~rs1)&rs3, t != rd, rs2

{

ANDN t, rs1, rs3

AND rd, rs1, rs2

OR rd, rd, t

}

I must admit I cannot really see the case for MIX.

A hypothetical NAND and NOR on the other hand, these are symmetrical and consequently lack the same "clout".

What *can* they be used for which AND/OR followed by NOT cannot achieve? Any loops where 3 or 4 instructions is a heavy penalty?

Not quite sure what you mean with "they" here.

For NAND and NOR, they just avoid the need for a two instruction sequence, bring Bitmap in sync with RVV and

x86, and finally reduce the need for C.NOT, and but at the cost of two extra instructions and a invertor for rs1 similar to the existing one on rs2.

For AND.NE and AND.EQ the important thing is to be cheaper than the alternative

CMOV_WITHBRANCH rd rs1 rs2 # rd = rs1 ? rs2 : rs3

{

C.MV rd, rs3

C.JEQZ rs1, +2

C.MV rd, rs2

}

when _not_ fused. This sequence is shorter in bytes but it branches, although people seem to

be experimenting with using a fuse of the last two instructions (one instruction branch + instruction)

as a conditional instruction. People more qualified than me can comment

wether that is really easy or creates its own problems. At least AND.NE and AND.EQ (and ELVIS) are just

simple ALU instructions and are a win if the whole CMOV can be replaced with one of them.

Rogier

L.

lkcl

unread,

Oct 5, 2019, 6:44:45 AM10/5/19

to RISC-V ISA Dev, rogier....@gmail.com, bruce...@sifive.com, bm1...@bristol.ac.uk

On Saturday, October 5, 2019 at 5:42:43 PM UTC+8, Rogier Brussee wrote:

> If I understand correctly, the point of the CMOV instruction is to reduce the number of loops and in particular make better use of the branch prediction hardware
> by not bothering it with something that emulates a CMOV.

*emulates* a CMOV. now that is very interesting, because the same logic may apply to AND, NAND, and other decision-making based on branches and so on.

In parallel architectures you *have* to use nonbranching decisions. It never occurred to me before that scalar operations would have a cost as well.

> With AND.NE/ AND.EQ and XOR or OR/temporary you can achieve the same goal without needing
> a architecturally expensive 3in 1 out instruction although at the cost of being slightly more instructions unless you can use AND.NE (or AND.EQ) or ELVIS directly.

uhhhuhn [1]

> Not quite sure what you mean with "they" here.

NAND and NOR.

>
> For NAND and NOR, they just avoid the need for a two instruction sequence, bring Bitmap in sync with RVV and
> x86, and finally reduce the need for C.NOT, and but at the cost of two extra instructions and a invertor for rs1 similar to the existing one on rs2.

Doesn't sound so bad. Btw found the x86 opcode:

https://www.felixcloutier.com/x86/vptestnmb:vptestnmw:vptestnmd:vptestnmq

> For AND.NE and AND.EQ the important thing is to be cheaper than the alternative
>
>
> CMOV_WITHBRANCH rd rs1 rs2 # rd = rs1 ? rs2 : rs3
> {
> C.MV rd, rs3
> C.JEQZ rs1, +2
> C.MV rd, rs2
> }

That would be costly in power terms, triggering branch prediction hardware.

> wether that is really easy or creates its own problems. At least AND.NE and AND.EQ (and ELVIS) are just
> simple ALU instructions and are a win if the whole CMOV can be replaced with one of them.

yehyeh.

L.

[1] Elvis. Get it? No? I'm all shook up you didn't get it.... [2]
[2] sorry... couldn't resist :)

Janos Vegh

unread,

Oct 5, 2019, 3:40:46 PM10/5/19

to RISC-V ISA Dev

Some projects plan to base multiprocessor systems (including supercomputers) on RISC-V.

I wonder what is the effect of extensions like atomic operations (as well as vector operations, etc.)

on the pipeline in multiprocessor (and/or multithread) environments.

The machine instruction level solution to provide atomic operations can be acceptable

when concerting the operation of just two threads (and/or processors), but can be tragic in the

case of high-performance (parallelized sequential) environment.

Regards

Janos

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/bdd8a81a-031e-4d5b-88b8-8b1bf2e6fc14%40groups.riscv.org.

lkcl

unread,

Oct 5, 2019, 4:00:15 PM10/5/19

to RISC-V ISA Dev

On Sunday, October 6, 2019 at 3:40:46 AM UTC+8, Janos Vegh wrote:

Some projects plan to base multiprocessor systems (including supercomputers) on RISC-V.
I wonder what is the effect of extensions like atomic operations (as well as vector operations, etc.)
on the pipeline in multiprocessor (and/or multithread) environments.
The machine instruction level solution to provide atomic operations can be acceptable
when concerting the operation of just two threads (and/or processors), but can be tragic in the
case of high-performance (parallelized sequential) environment.

That's a really good question, so I changed the subject line, hope I summarised it correctly.

L.

Regards
Janos

On Fri, Oct 4, 2019 at 8:52 PM lkcl <luke.l...@gmail.com> wrote:

Rogier, the AND.NE etc make a lot of sense as macro op fused overlays to construct ternary ops. Two different CMOV sequences, both with different assumptions, could be used under mutually alternate circumstances.

Intuitively I can foresee that in a bitlevel fashion the same tricks using ANDN etc could be used to do ternary bit selection i.e. bit mixing from 2 sources.

A hypothetical NAND and NOR on the other hand, these are symmetrical and consequently lack the same "clout".

What *can* they be used for which AND/OR followed by NOT cannot achieve? Any loops where 3 or 4 instructions is a heavy penalty?

L.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa...@groups.riscv.org.

Allen Baum

unread,

Oct 7, 2019, 7:14:37 PM10/7/19

to lkcl, RISC-V ISA Dev, Rogier Brussee, Bruce Hoult, Ben Marshall

This has been mentioned before, but Alpha had a cmov rd = cond?rs1:rs2, but only implemented a microarchitecture with 2 inputs.

And since it was out of order, it renamed all destinations.

Their implementation cracked the op into a

cmovT rd= cond?src1:rd but really if(cond) {rd=src1}

cmovF rd= Cond?rd:src2 but really if(!cond) {rd=src2}

so didn't need to fetch rd at all , and special cased renaming so it renamed rd only once.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/3d843540-d5c6-44b5-a9df-aaa8c94d29e3%40groups.riscv.org.

Andrew Waterman

unread,

Oct 7, 2019, 8:24:15 PM10/7/19

to Allen Baum, lkcl, RISC-V ISA Dev, Rogier Brussee, Bruce Hoult, Ben Marshall

On Mon, Oct 7, 2019 at 4:14 PM Allen Baum <allen...@esperantotech.com> wrote:

This has been mentioned before, but Alpha had a cmov rd = cond?rs1:rs2, but only implemented a microarchitecture with 2 inputs.
And since it was out of order, it renamed all destinations.
Their implementation cracked the op into a
cmovT rd= cond?src1:rd but really if(cond) {rd=src1}
cmovF rd= Cond?rd:src2 but really if(!cond) {rd=src2}
so didn't need to fetch rd at all , and special cased renaming so it renamed rd only once.

If you are referring to the scheme used in the Alpha 21264, it had the additional cost of widening the physical registers by one bit to hold the predicate value produced by the first uop. The scheme is described on page 28 of https://www.cis.upenn.edu/~milom/cis501-Fall09/papers/Alpha21264.pdf

The scheme you describe would make the bypass control logic and possibly wakeup logic dependent on the data value of cond (which of the two uops do you bypass from?).

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DAxf%2B_mtmO6F_f-_-n7EdWeon37ZeFf-gVi60X0Bon1RQ%40mail.gmail.com.

Allen Baum

unread,

Oct 8, 2019, 3:45:27 PM10/8/19

to Andrew Waterman, lkcl, RISC-V ISA Dev, Rogier Brussee, Bruce Hoult, Ben Marshall

Yea, that is what I was alluding to with the actual renaming only happening once.

The article says a result is placed into an "internal register" with a 65th bit - which seems to imply that it isn't in the physical register file

That sounds cheaper than having all physical registers have a 65th bit- but does require that no other conditional move be scheduled between the two micro-ops.

Still, that may ultimately be easier than preventing the a second conditional move to start.

So the actual implementation appears to be a bit different than I wrote above:

cmov1 (cond, src1) {spcl.cond = cond; spcl.data = src1} //reads cond and src1 and writes spcl

cmov2 (rdst, src2) {rdst= . spcl.cond ? spcl.data : src2} // reads spcl and src2 and writes rdst

By using the dedicated special reg, only 2 registers ever need to be read in a cycle