<this a copy paste of the attached file which may be slightly better formatted>
Review of Xbitmanip V0.9
Rogier Brussee
2019-09-03
Disclaimer:
Views are strictly my own, I have no particular expertise in this area, but have been following discussions for quite a while. Proposals are gratuitous, not based on quantitative analysis other than “you propose this but it can be done so with one or two instructions shaved of with this seemingly simpler instruction”, Finally review is not criticism, YMMV.
General remarks:
Great that the B extension is moving again and making progress. Special thanks to Clifford for pushing this for a long time, must have been an awful lot of work.
See below for general remarks on the B extension and other bit -related Z extensions.
For many sections it is not quite clear what the W versions (for RV64) or D versions (for RV128) are supposed to do. I will try to indicate how I understand it works.
It would be useful if a rationale for the use of an instruction would be given and quantitative data on use were available.
General remarks about this review
I try to use uppercase for instructions in the text to make them stand out and lower case in definitions, but there is no semantic difference.
Many sections contain C code as a way to define what the instruction is supposed to be doing. I like this because it is usually much easier to follow than words. However, I think there is value in making the definition as conceptually simple as possible even if it is (much) slower, and I will suggest some. Can we move faster but less intuitive implementations to an appendix? The simplest versions will usually be “bit by bit” algorithm , for which I use the following macros and functions
* I use LEN which can be #defined to XLEN to get the regular version or to 32, 64, for the W and D versions. One could perhaps use C++ template notation like sext<LEN> () instead.
* If the C code definitions involving shifts I prefer to write something like rs1 << (rs2 % LEN) because the shifts in RV are mod LEN. wherever a shift is made in C code, to emphasise that shifts in RV are mod XLEN, 32 (for W) or 64 (for D). This also makes it slightly clearer that it makes sense to shift over negative numbers.
As a result I avoid definitions like shamt = rs2 & (LEN - 1) which, of course, is equivalent but conceptually less clear.
* If the order of a loop is not important I use this macro
/* for loop that does not care about order: fake it by running backwards */
#define FOREACH( I , BEGIN, END) for( I = (END) - 1 ; I >= (BEGIN) ; —I)
* I use functions that do sign and zero extensions acting on the full XLEN bit register. Hardware is supposed to have better/cheaper ways to do this than the C code below.
int_xlen_t sext(int_xlen_t rs1, unsigned imm)
{
return (rs1 << ((- imm) % XLEN)) >> ( (- imm) % XLEN ) )
}
uint_xlen_t zext(uint_xlen_t rs1, unsigned imm)
{
return (rs1 << ((- imm) % XLEN)) >> ( (- imm) % XLEN ) )
}
* I use single bit functions acting on the full XLEN bit register, to set get and clr bits (named after some instructions in the proposal)
uint_xlen_t bget1(uint_xlen_t rs1, unsigned imm)
{
return rs1 & (1 << (imm % XLEN));
}
uint_xlen_t bset1(uint_xlen_t rs1, unsigned imm)
{
return rs1 | (1 << (imm % XLEN));
}
uint_xlen_t bclr1(uint_xlen_t rs1, unsigned imm)
{
return rs1 &~ (1 << (imm % XLEN));
}
uint_xlen_t binv1(uint_xlen_t rs1, unsigned imm)
{
return rs1 ^ (1 << (imm % XLEN));
}
uint_xlen_t sbget1(uint_xlen_t rs1 , unsigned imm)
{
return !!bget1(rs1, imm)
}
=======
Page 1 Chapter 1.
Comment: I would propose a further criterium:
Support compilers implementing the standard C like language basic data types (pointers, _Bool/bool, char, [un]signed char, [unsigned] short, [unsigned] int, [unsigned] long long) and their basic operations without surprises regarding performance and/or code size, in particular for software written sub optimally or not written for RiscV e.g. using unsigned or short integers.
Page 4. Section 2.1. Basic Manipulation instructions
Comment: It seems to me that the B extension has moved to “integer instructions that extend or optimise the basic I instructions. I think this is a good thing, and I would argue that the B extension should position as slightly cheaper than the M extension, useful for basic compiler support, standard applications like bit banging in device drivers, networking, and very basic algorithms like hashing.(see also the remark on supporting the C/C++ language). Z extensions can then cater for more specialised purposes.
Page 4: section 2.1.1 CLZ/CTZ
Alternative implementations :
uint_xlen_t clz(uint_xlen_t rs1)
{
uint_xlen_t rd = 0;
for (int i = (LEN-1) ; i >= 0 && bget1(rs1, i) ; —i)
++rd;
return rd;
}
uint_xlen_t ctz(uint_xlen_t rs1)
{
uint_xlen_t rd = 0;
for (int i = 0 ; i < LEN && bget1(rs1, i) ; ++i)
++rd;
return rd;
}
Please add:
For the W versions set LEN = 32.
Please add
Clz and ctz are used for finding the first elements of a bit map. In particular it is used for the implementation of malloc. Clz is also used for soft float
Section 2.1 Basic bit manipulation
Page 5: section 2.1.2 clz/ctz
Page 6 section 2.1.2 PCNT
Alternative implementation
uint_xlen_t pcnt(uint_xlen_t rs1)
{
uint_xlen_t rd = 0;
int i;
FOREACH(i , 0, LEN)
rd += sbget1(rs1, i);
return rd;
}
Please add:
For the W version set LEN = 32
Comment
Maybe it is useful to have a parity instruction even if this can be implemented with pcnt and &1, because it should use less power.
Proposal:
We have ctz(n) == pcnt(~n & (n -1)) == pcnt(~ (n |(-n))),
A two register instruction
cntz rd rs1 rs2 : rd = popcount(~ (rs1 | (-rs2)))
Would allow pcnt, ctz and clz to be implemented as 3 macros with at most two instructions instead of 3 instructions
pcnt rd rs1 : addi rd rs1 1; cntz rd zero rd
ctz rd rs1 : cntz rd rs1 rs1
clz rd rs1 : brev rd rs1; cntz rd rd rd
page 5 section 2.1.3. Logic with negate (andn, orn, xnor)
Comment: I like that all basic logical operations have a “n” version.
Is nand and nor not also worth it (the RVV mask instructions seem to have them for example) ?
Proposal:
I would propose the following additional logic instructions
#and rd rs1 rs2 # rd = rs1 & rs2
andn rd rs1 rs2 # rd = rs1 &~ rs2
nand rd rs1 rs2 # rd = ~(rs1 & rs2)
orn rd rs1 rs2 # rd = ~(rs1 &~ rs2) = ~rs1 | rs2
and.ne rd rs1 rs2 # rd = (-!!rs1) & rs2 == (!!rs1) ? rs2 : 0
andn.ne rd rs1 rs2 # rd = (-!!rs1) &~ rs2 == (!!rs1) ? ~rs2 : 0
orn.eq rd rs1 rs2 # rd = ~((-!!rs1) & rs2) == (-!rs1) |~ rs2 == (!!rs1) ? ~rs2 : (-1)
or.eq rd rs1 rs2 # rd = ~((-!!rs1) &~ rs2) == (-!rs1) | rs2 == (!!rs1) ? rs2 : (-1)
#or rd rs1 rs2 # rd = rs1 | rs2
nor rd rs1 rs2 # rd = ~(rs1 | rs2)
#xor rd rs1 rs2 # rd = rs1 ^ rs2
xnor rd rs1 rs2 # rd = ~(rs1 ^ rs2) == rs1 ^ rs2 ^ (-1) == rs1 ^ ~ rs2
This would make AND a 3 bit family (i.e. a basic instruction with some “option flags” or small parameters, here nothing/~ on rs2, nothing/-!! on rs1, nothing/~ on rd) and OR and XOR a 1 bit family of instructions (nothing/~ on rd) that can be placed with the AND OR and XOR instructions themselves (i.e an instructions ANDN, NAND and AND.NE are supposed to have the same func3 values as AND being “AND”-like, but differ in a few bits in the funct7 space ).
The ORN instruction with the inversion on rs1 would allow
C.not rd —> orn rd rd zero
to be more regular than XOR rd rs -1 or a ORN with a ~ on rs2, but with ANDN, NAND, ORN, NOR and XNOR, a C.NOT instruction may not be such a high priority instruction anymore.
The following macros implement the boolean (i.e. 0/1 valued) operators in two instructions
sand rd rs1 rs2 : and.ne rd rs1 rs2 ; neqz rd rd # rd = !!((-!!rs1) & rs2) == !!rs1 & !!rs2
snand rd rs1 rs2 : and.ne rd rs1 rs2 ; eqz rd rd # rd = !((-!!rs1) & rs2) == !rs1 | !rs2
sorn rd rs1 rs2 : or.eq rd rs1 rs2 ; neqz rd rd # rd = !!((-!rs1)| rs2) == !rs1 | !!rs2
sandn rd rs1 rs2 : or.eq rd rs1 rs2 ; eqz rd rd # rd = !((-!rs1)| rs2) == !!rs1 & !rs2
sor rd rs1 rs2 : or rd rs1 rs2. ; neqz rd rd #rd = !!(rs1 | rs2) == !!rs1 | !!rs2
snor rd rs1 rs2: or rd rs1 rs2 ; eqz rd rd #rd = !(rs1 | rs2) == !rs1 & !rs2
sneq rd rs1 rs2: xor rd rs1 rs2 ; neqz rd rd #rd = !!(rs1 ^ rs2) == rs1 != rs2
seq rd rs1 rs2: xor rd rs1 rs2 ; eqz rd rd #rd = !(rs1 ^ rs2) == rs1 == rs2
In the vast majority of cases the NEQZ and EQZ instructions will be taken care of by the branch or the AND/OR.NE/EQ instructions.
The macros
mneqz rd rs1 : andn.ne rd rs1 zero
meqz rd rs1 : or.eq rd rs1 zero
give a logic value as 0/(-1) instead of 0/1 which is useful in combination with masking.
Like the other logical instructions these instructions preserve but don’t establish the invariant of a word being sign extended. No W versions are proposed.
Page 6 section 2.1.4 pack
uint_xlen_t pack(uint_xlen_t rs1, uint_xlen_t rs2)
{
return zext(rs1, LEN/2) | (zext(rs2, LEN/2) << LEN/2);
}
Where LEN = 32 for the W version.
please add
The pack instruction can be used to pack “struct” data that fits in a XLEN sized register and is passed in registers.
Proposal:
I really like the pack instructions. However I would propose to have 4 and XLEN independent pack instructions:
pack.bu rd rs1 rs2 # rd = zext(rs1, 8) | ( zext(rs2, 8) << 8 )
pack.hu rd rs1 rs2 # rd = zext(rs1, 16) | (zext(rs2, 16) << 16)
pack.wu rd rs1 rs2 # rd = zext(rs1, 32) | (zext(rs2, 32) << 32). //(RV64)
pack.du. rd rs1 rs2 # rd = zext(rs1,64) | (zext(rs2, 64) << 64) //(RV64, 128)
Page 7.
Comment.
One should not need to use shfl instructions to pack bytes. See proposal for pack.bu.
The ZEXT pseudo instruction could also use ADDWU. Also this is (as far as I can see) the only mention of RV128. It should perhaps be mentioned everywhere (preferably) or nowhere.
Page 7 section 2.1.5 Min/Max
Proposal:
I would propose to also add SGE and SGEU. As is done in RI5CY/PULP, this would be nicely placed with the SLT and SLTU instructions (i.e. have the same func3 bits as SLT and SLTU see proposal in section 2.1.3 ) making each a 2 bit family
//slt rd rs1 rs2 # rd = rs1 <_s rs2
sge rd rs1 rs1 # rd = rs1 >=_s rs2
min rd rs1 rs2 # rd = rs1 <_s rs2 ? rs1 : rs2
max rd rs1 rs2 # rd = rs1 >=_s rs2? rs1 : rs2
//sltu rd rs1 rs2 # rd = rs1 <_u rs2
sgeu rd rs1 rs1 # rd = rs1 >=_u rs2
minu rd rs1 rs2 # rd = rs1 <_u rs2 ? rs1 : rs2
maxu rd rs1 rs2 # rd = rs1 >=_u rs2? rs1 : rs2
Perhaps this would make it beneficial to define
//sneqz rd rs1 : sltu rd zero rs1
seqz rd rs1 : sgeu rd zero rs1 # instead of sltui rd rs1 1
Page 8
Taking the absolute value of an integer is a generally useful operation and I think it is not needed to mention SAT solvers.
Page 8. Section 2.1.6 sbset/sbclr/sbinv/sbext
Please add:
These instructions efficiently support bitfields, and avoid costly constant constructions and nasty surprises especially if the bitfields are not in low bits.
Bikeshed.
I think the s prefix should be reserved for instructions returning booleans (0/1), which means only SBEXT is properly named. I would suggest BSET1, BCLR1, BINV1, and SBEXT1 or SBGET1 (but see below)
I would argue that a BGET1 instruction
uint_xlen_t bget1(uint_xlen_t rs1, uint_xlen_t rs2)
{
return rs1 & ((uint_xlen_t)(1) << (rs2 % LEN))
}
Is more useful and cheaper than a SBGET1 instruction.
Page 9 Section 2.1.7 SLO/SRO
Comment:
I can see that these instructions are cheap on the alu . However I still don’t really see too much rationale for them and each uses a precious 7 bit immediate.
The main application seems to be to construct a mask of imm 1’s i.e. (1 <<imm ) - 1, but the sequence
li rd -1 ; srli rd rd (-imm % XLEN)
Is just as long if rd is one of the popular registers x8—x15 and can be fused if needed.
Page 10 section Bit Permutation Instructions
Page 10 section 2.2.1 rol/ror
Please add:
For the rotation instruction we consider the bit pattern in a register as being indexed by a number in the integers mod LEN (and do sign extension for LEN < XLEN), and “shift” the bit pattern in rs1 by offsetting the indices considered as an integer mod LEN with rs2 mod LEN.
The main use of the rotation instructions are block cyphers, hashing and ??????.
uint_xlen_t ror(uint_xlen_t rs1 , uint_xlen_t rs2)
{
//FOREACH(i, 0, LEN)
// rd[i] = rs1[(i + rs2) % LEN]
uint_xlen_t rd = (rs1 >> (rs2 % LEN)) | (rs1 << ((-rs2) % LEN));
return sext(rd, LEN);
}
uint_xlen_t rol(uint_xlen_t rs1 , uint_glen_t rs2)
{
return ror(rs1, -rs2);
}
For the w version LEN = 32 and the w versions of the shift are used.
Comment:
While rol and ror are the easiest to explain permutation, they are also the most easily done with just shifts and OR’s. The butterfly network is also not really simpler than that for GREV, i.e slightly expensive. With the below SLLNEG/SRLNEG and more importantly, immediate versions, it is just 3 instructions. MIPS doesn’t have a ROL and ROR instruction. Are ROL and ROR actually worth it?
Proposal:
ROL is just ROR with rs2 negated i.e. “ROL == RORNEG” (or equivalently ROR == ROLNEG). Therefore, one of ROL and ROR can be considered overkill especially since the immediate version will be much more common than the register version. Alternatively, if ROL is considered useful, the corresponding NEG versions for the regular shifts should be even more useful.
Define SLLNEG, SRANEG and SRLNEG instructions
sllneg rd rs1 rs2 # rd = rs1 << ((-rs2) % XLEN))
srlneg rd rs1 rs2 # rd = rs1 >>_u ((-rs2) % XLEN))
sraneg rd rs1 rs2 # rd = rs1 >>_s ((-rs2) % XLEN))
sllnegw rd rs1 rs2 # rd = rs1 <<_w ((-rs2) % 32))
srlnegw rd rs1 rs2 # rd = rs1 >>_wu ((-rs2) % 32))
sranegw rd rs1 rs2 # rd = rs1 >>_ws ((-rs2) % 32))
This shaves of one instruction of doing ROL/ROR or FSL/FSR with shifts and OR, but likewise shaves of two instructions in the bit extraction and insertion pattern for the register version.
The instructions can use the existing negator on rs2, although the bit to indicate negation in the add instruction is unfortunately already taken up for arithmetic vs. Logical shift for the right shifts. No immediate version is needed.
Page 11 section 2.2.2
Please add:
For the rotation instruction we considers the bit pattern in a register as being indexed by LOG2(LEN) bits (and do sign extension for LEN < XLEN), and “shifts” the bit pattern in rs1 by XOR-ing the indexbits considered as LOG2(LEN) bits with the last LOG2(LEN) bits of rs2 (i.e. rs2 & (LEN - 1) == rs2 % LEN)
Please keep
The GREV instruction provides …… instruction.
Please change :
It takes a single register…. reversals occur.
Please add.
Byte order swap (rs2 == -8) is essential for networking. Bit order reversal (rs2 == -1) is needed for the fast Fourier transform and “reverse addressing”.
Please change:
s/This operation iteratively/The grev instruction can be implemented with a butterfly network that iteratively/
Page 12.
Comment:
I really don’t like the C functions of grev as a definition. The “intuitive to understand” is in fact not so intuitive. . Can we put that in an appendix??? The obviously correct (but dreadfully slow) definition is:
uint_xlen_t grev(uint_xlen_t rs1, uint_xlen_t rs2)
{
//for(int i = 0; i < LEN; ++i)
// rd[i] = rs1[i ^ (rs2 & (LEN -1))]
uint_xlen_t rd = 0;
FOREACH( i, 0, LEN)
rd |= sbget1(rs1, i ^ (rs2 & (LEN - 1)) << i
return sext(rd, LEN);
}
The W version has LEN = 32, the D version has LEN = 64.
Comment.
I really don’t like the explanation of the pseudo-instructions below the table. What do they actually do? Something like
In other words the prefix controls the number of bits of the grouping whose order is reversed as a group, (bits, bytes, words..) where as the suffix controls the number of bits in within which the order is reversed. Thus rev8 swaps bytes within the whole XLEN bit register, whereas rev8.w swaps bytes within each of the 32 bits word within a XLEN register.
Comment.
Grev/Grevi is probably the right thing to do. However, I want to point out that given that the overwhelmingly most common use case would be byte swap (i.e. imm = -8), and that moreover all grev’s can be reduced to grev’s of a negative power of 2 grev’s of a power of 2, just as they can be reduced to be reduced to grev’s of a posive power of 2 i.e. the butterfly stages. This is because
grev(rs1, k) == grev(rs1, k % XLEN),
grev(rs1, k ^ l) == grev(grev(rs1, k), l),
grev(rs1, 0) == rs1,
grev(rs1, (-a)&k) == a? grev(rs1, k) : rs1 for a = 0,1,
Finally every number 0 <= k <= XLEN - 1 can be written as
k = ((- a_0) & (-1) )^ ((-a_1)&(-2)) ^ ((-a_2)&(-4)) ^ …. ^ ((-a_{LOG2(XLEN) -1})&(-XLEN/2))
with the single bits a_0, a_1, a_2, .. a_{LOG2(XLEN) -1} defined uniquely by the identities
sbget1(k, 0) == a_0, sbget1(k, i) ^ sbget1(k, i-1) == a_i , i > 0.
Therefore there is the possibility to have the grev functionality with instructions corresponding to a single stage of a slightly different butterfly network but with the freedom of passing a mask to select which bits are swapped by mixing at the cost of having to explicitly compose a sequence of max LOG2(XLEN) instructions if the full functionality of grev is to used:
rev1 rd rs1 rs2 # rd = (grevi(rs1, -1 ) &~ rs2) | (rs1 & rs2) implements brev rd rs1 : rev1 rd rs1 zero
rev2 rd rs1 rs2. # rd = (grevi(rs1, -2 ) &~ rs2) | (rs1 & rs2)
rev4 rd rs1 rs2 # rd = (grevi(rs1, -4 ) &~ rs2) | (rs1 & rs2)
rev8 rd rs1 rs2 # rd = (grevi(rs1, -8 ) &~ rs2) | (rs1 & rs2) implements bswap rd rs1: rev8 rd rs1 zero
rev16 rd rs1 rs2 # rd = (grevi(rs1, -16) &~ rs2) | (rs1 & rs2)
rev32 rd rs1 rs2 # rd = (grevi(rs1, -32) &~ rs2) | (rs1 & rs2) //(RV64, RV !28)
rev64 rd rs1 rs2 # rd = (grevi(rs1, -64) &~ rs2) | (rs1 & rs2) //(RV128)
P9 Generalised shuffle and unshuffle.
Comment:
I must admit, I neither understand the description nor the rationale for these instructions other than that it allows arbitrary bit permutation in LOG2(XLEN) ops Nor do I understand why the zip versions are especially important. Not knowing the shuffle operations is, I am sure, a lack of education on my part, but a clearer explanation would help: (a clearer definition? is there an obvious algorithm to implement an arbitrary permutation? Is this optimal? Reference?)
Comment:
From the description in Hackers Delight, I get that the important cases are transpose of bit matrices. I do understand matrix transpose and e.g. making the even bits the first and the odd bits last bits! You have a matrix transpose later which the following would generalise.
Proposal:
/* Consider a register as a matrix of XLEN >> rs2 rows of length 1 << rs2 and transpose, i.e. interchange rs2 least significant bits become the rs2 most significant bits, and the XLEN - rs2 most significant bits become the XLEN - rs2 least significant bits of the index.
The immediate version has a 3 bit immediate, and a register version may not be needed)
*/
uint_xlen_t transp(uint_xlen_t rs1, uint_xlen_t rs2)
{
//for(int i = 0 ; i < LEN; ++i)
// rd[ i ] = rs1[ (i >> rs2) % LEN | (i << (LOG2(LEN) - rs2) % LEN] ;
uint_xlen_t rd = 0;
int i;
FOREACH(i, 0 XLEN)
rd |= sbget1(rs1, (i >> rs2) % LEN) | (i << (LOG2(LEN) -rs2) % LEN) << i;
return sext(rd, LEN);
}
for the W version LEN = 32, for the D version LEN = 64 ( I am not sure a W or D version is needed however).
Page 17.
I don’t understand the notation in table 2.2
Page 17 — 19
I don’t like the definition of SHFL instruction as a C function here and the pattern for RV128 is definitely not so clear if you cannot visualise the numbers as bit patterns (although the picture helps). Can we please have a (possibly slow) more conceptual bitwise definition?
Page 20
What is meant with “bitwise prefix sum of the bits in the lower byte of a 32 bit word on RV32” here?
Page 21
While clever, where do these (type of) problems come from that they merit an instruction.
Page 22.
Section 2.3 bext /bdep
Comment:
I like these instructions because they are clean and clear. However it would still be good to have applications that e.g. use the x86 BMI instructions. The example seems a little artificial.
Since you already say that they often are not faster than software emulation, and one of the first hits you get for PEXT is that AMD implements them horribly slow, maybe they should not be in B but still be defined as Z extension.
The following C functions for BEXT/BDEP are, I think, easier to understand and faster as software implementations (but still far from optimal)
uint_xlen_t bext(uint_xlen_t rs1, uint_xlen_t rs2)
{
uint_xlen_t rd = 0;
uint_xlen_t dbit, bit;
//for each bit in the source mask,
for(dbit = 1; (bit = rs2 &~ (rs2 -1)); dbit <<=1, rs2&= ~bit )
rd |= (bit & rs1 )? dbit : 0;
return rd;
}
uint_xlen_t bdep(uint_xlen_t rs1, uint_xlen_t rs2)
{
uint_xlen_t rd = 0;
uint_xlen_t sbit, bit;
//for each bit in the destination mask
for(sbit = 1; (bit = rs2 &~ (rs2 -1)); sbit <<=1, rs2&= ~bit )
rd |= (sbit & rs1)? bit : 0;
return rd;
}
Page 23 CLMUL/CLMULH/CLMULR
Please add
The CLMUL and CLMULH/ CLMULR instructions implement polynomial multiplication over the Galois field GF(2) and the closely related multiplication in the higher Galois fields GF(2^n). They are used extensively in cryptography and can be used to compute CRC’s which are defined by computations in Galois fields.
Bikeshed :
Carryless multiplication is the termininology used by Intel, but that makes the mathematician in me weep. This is polynomial multiplication for the love of $DEITY! Can we please call these instructions POLMUL[HR] ?
Page 24
S/GCM/GCD/GCM (greatest common divisor/multiple) of binary polynomials/
What is the prefix XOR operation?
Page 26
Comment
Isn’t a CLMUL instruction easier and cheaper than a MULT instruction, so may be implemented even if M is not implemented? ( I may be wrong here and obviously a MULT instruction has more general benefit).
Page 26 section 2.6 bmatxor, Bator, bmatflip
I like the binary metrix operations. However, maybe it is enough to just define the more primitive multiplication between a matrix and a vector. We can then also have different sized (power of two by power of two) sized matrices multiplied to vectors.
Proposal:
orred rd rs1 3imm #take the OR reduction of groups of 1 << imm bits
# rd = 0;
# FOREACH(I, 0, XLEN)
# rd[i >> imm] |= rs1[i];
xorred rd rs1 3imm # take the XOR reduction of groups of 1 << imm bits
# rd = 0;
# FOREACH(i, 0; XLEN)
# rd[i>> imm] ^= rs1[i];
repeat rd rs1 3imm # repeat the first 1 << imm bits
# rd = 0;
# FOREACH(i = 0, XLEN)
# rd[i] |= rs1[i % 1<<imm]
transp rd rs1 3imm #transpose a bit matrix of 1 << imm column and XLEN>>imm rows
# rd = 0;
# FOREACH(i, 0, XLEN)
# rd[i] |= rs1[ (i<<imm) % XLEN | (i >> (LOG2(XLEN) - imm)) % XLEN ]
These can then be used to compute a matrix vector multiplication using REPEAT, AND and XORRED.
uint_xlen_t matvec(uint_xlen_t rs1, uint_xlen_t rs2, unsigned log2_size)
{
uint_xlen_t v = repeat(rs2, log2_size);
uint_xlen_t rs1_v = rs1 & v;
return xorred(rs1_v, log2_size);
}
However these instructions are also useful for computing parity, creating masks, or (in the case of ORRED) finding the terminating zero in a string, reading in XLEN/8 bytes at a time.
Page 29. Section 2.7.1
Comment:
Ternary operators have pushback because they are a burden on the micro architecture, and seem like the perfect candidate to replace by a fusable instruction sequence that can be recommended to compiler people, provided this is feasible.
In this case two such sequences exist: since
(rs1&rs2) | ((~rs1) & rs3) == (rs1 & rs2) ^ ((rs1 ^ (-1)) & rs3)
== (rs1&rs2) ^ (rs1 & rs3) ^ rs3
== (rs1 & (rs2 ^ rs3))) ^ rs3
We can write
CMIX rd rs1 rs2 rs3 # rs1&rs2 | (~rs1) & rs3, assume rd != rs1, rs3
{
XOR rd rs2 rs3
AND rd rd rs1
XOR rd rd rs3
}
and
CMIX rd rs1 rs2 rs3 # rs1&rs2 | (~rs1) & rs3, assume rd != rs1, rs2
{
XOR rd rs2 rs3
ANDC rd rd rs1
XOR rd rd rs2
}
which gives some freedom for the compiler writer.
Page 29 2.7.2
Comment:
We have the same issue with the ternary operator here of course.
If we have the AND.NE instruction we can proceed as for the mix operator and define the fusable sequences
CMOV rd rs1 rs2 rs3. # rs1 ? rs2 : rs3 assume rd != rs1, rs3
{
XOR rd rs2 rs3, assume rd != rs1, rs3
AND.NE rd rs1 rd
XOR rd rd rs3
}
With a similar alternative sequence with AND.EQ with constraints rd != rs1, rs2
Or we can just define the sequences
CMOV rd rs1 rs2 rs3 # rs1 ? rs2 : rs3, assume rd != rs1, rs3
{
C.MV rd rs2
C.JNEZ rs1 +2
C.MV rd rs3
}
with and an alternative sequence with JEQZ.
Note that SiFive is already optimising a JNEZ/ JEQZ of +2 bytes and a C.MV (or other simple arithmetic instructions) as defining a conditional instruction.
Proposal:
An important special case of the three operand conditional move (a ? b : c) is the two operand elvis operator a?:c which is (!!a)? a : c. (according to wikipedia ?: is called Elvis operator because it was/is an emoji for Elvis Presley). Instead of a CMOV define an “ELVIS” instruction
elvis rd rs1 rs2 # rd = rs1 ?: rs2 == (!!rs1) ? rs1 : rs2
Page 30
Does changing the order of the registers in the assembler compared to the encoding not create more confusion than it solves (I genuinely don’t know).
Please add:
For use of the FSL and FSR instruction see section 3.
Spelling out the C function as
unit_xlen_t fsr(uint_xlen_t rs1, uint_xlen_t rs2, uint_xlen_t rs3)
{
uint_xlen_t rd, tmp;
//test that rs3 % (2*XLEN)) >= XLEN
tmp = rs3 & (1 << LOG2(XLEN)); //ANDI
if (tmp ) { //JNEZ
rd = rs3 << (rs2 % XLEN); //SLL
tmp = rs1 >> ((-rs2)%XLEN); /SRLNEG
rd |= tmp ; //OR
}else{
rd = rs1 << (rs2 % XLEN); //SLL
tmp = rs3 >> ((-rs2)%XLEN)); //SRLNEG
rd |= tmp ; //OR
};
return rd;
}
unit_xlen_t fsl(uint_xlen_t rs1, uint_xlen_t rs2, uint_xlen_t rs3)
{
return fsr(rs1, -rs2, rs3);
}
shows that the immediate versions(where the compiler can do the test) are just three instructions
(assuming a SRLNEG instructions exists otherwise it is four) and a temporary register clobber, while the non immediate version is just 5 at runtime and 8 static.
Also FSL is FSRNEG i.e. FSL and FSR are essentially equivalent instructions which take up lots of encoding space, so two ternary instructions may be a bit overkill.
Page 31 Section 2.8 Unsigned address calculation
Comment: I like these instructions. Since adds are so dominant in usage and in particular in address calculations in most code they may well turn out the most used instructions of the B extension.
What does ADDIWU do? Does it zero or sign extend its 12 bits immediate? (I would suggest zero extending is more useful as it increases the range of constants that can be defined in one instruction)
The i^j example is not terribly good as i^j happens to be already zero extended, if i and j are properly sign extended. However the seemingly innocuous expressions p[i] or p + i for i unsigned already involves a zero extend of i (if p is a char*) or a zero extend and a shift.
Please add:
The SLLIU.W instruction is used to compute addresses of array elements indexed by unsigned integers.
Comment:
Is there no SLLU.W (aka SLL.WU) instruction?
Comment:
It might be slightly better to have rs1 zero extended instead of rs2. That leaves open the possibility to have a slightly more regular C instruction
C.ZEXT rd —> ADD.WU rd rd zero
Bikeshed
I think these instructions should be called ADD.WU since one adds a zero extend Word to a XLEN sized to obtain a XLEN sized and SLLI.WU since one shifts a zero extended Word to get a XLEN sized.
Proposal:
I think the scheme could and should be extended to bytes and halves. In C the sum of two shorts is actually an int (rather than a short) so extending the scheme to bytes and halves covers software that uses (for better or for worse) shorts as an integer and avoids nasty surprises. So I propose to have for example
ADD.H rd rs1 rs2 # SEXT.H rd rs2; ADD rd rs1 rd
SUB.HU rd rs1 rs2 # ZEXT.HU rd rs2; SUB rd rs1 rd
Where of course SEXT.H and ZEXT.HU don’t actually exist as instructions, in fact can be made pseudo instructions using ADD.H and ADD.HU using rs1 == zero.
The logic of the W instructions then also gives instructions
ADDW.H rd rs1 rs2 # ADD.H rd rs1 rs2 ; SEXT.W rd rd
This should be complemented with
ADDWU.HU rs rs1 rs2 # ADD.HU rd rs1 rs2; ZEXT.WU rd rd
Likewise I think there is value in having pointer + integer instruction i.e. an instruction that absorbs the shift like LEA in x86 (one of the most commonly used arithmetic instructions on x86, although that is a bit misleading because it is also the 2 operand add on x86). People have been touting fusion of
SLLI rd rs2 imm
ADD rd rs1 rd
but note that one cannot use the RVC version of SLLI unless rd == rs2 (the register containing the index) and for basic implementations an “already fused instruction”
ADDSL<k> rd rs1 rs2 # SLLI rd rs2 <k> ; ADD rd rs1 rd
actually seems easier to implement than fusion.
Together that gives the following add instructions :
#add #addw #addd
addsl1 addwsl1 adddsl1
addsl2 addwsl2 adddsl2
addsl3 addwsl3 adddsl3
addsl4 addwsl4 adddsl4
addsl5 addwsl5 adddsl5
addsl6 addwsl6 adddsl6
addsl7 addwsl7 adddsl7
add.b addw.b addd.b
add.h addw.h addd.h
add.w addd.w
add.d
add.bu addwu.bu adddu.bu
add.wu addwu adddu.wu
add.du adddu
With the pattern repeating for sub.
While these are a lot of different ADD’s, the instructions form a densely encoded and regular
5 bit family of instructions (add/sub, XLEN/sized, (signed/unsiged, 2size) or 3shamt), and they all correspond directly to some form of add between basic datatypes in the C language.
For RV32 there are 2*( 7 + 2 + 2 ) == 22 new add/sub instructions. Moreover are dominant in ALU use, and
The additional SUB versions seems much less important than the add versions and may perhaps be skipped. However since SUB itself can obviously not be skipped, regularity of encoding suggests having them anyway.
Note that an ADDSL<k> instruction reduces the need for a SLLI.WU instruction: for 0 < imm < 8
SLLI.WU rd rs2 imm
C.ADD rd rs1 rd
is equivalent to (but 2 bytes shorter than)
ZEXT.WU rd rs2
ADDSL<imm> rd rs1 rd
Here I assumed that rs2 gets shifted and zero or sign extended. What is best for a potential C.ADDSL2/3 instruction probably all depends on fusing patterns.
Also SUB is non symmetric so there is a real difference between
(V1) SUBSL<k> rd rs1 rs2 # rd = rs1 - rs2 <<k
and
(V2) SUBSL<k> rd rs1 rs2 # rd = rs1 <<k - rs2
(But then again, quite possibly all the different variants of SUB is overkill)
Page 32, Section 2.10 Future compressed instructions
I would take this section out, or just suggest that C.NOT, C.NEG and C.ZEXT.W and C.ZEXT.D may be worthwhile additions. More quantitative data is needed however, and this just cannot be done just for the B extension. For example I think another instruction that may be worthwhile is
C.LPC rd —> AUIPC rd 0. # load program counter
to make PC -relative addressing (e.g. for loading constants!) more compact, and this would fit very nice in the C.LUI rd 0 slot. Personally I doubt that a C.NOT instruction is worth it, and given the choice, a C.NEG instruction is probably more generally useful than C.NOT. (Random case in point: on this random x86 instruction usage https://www.strchr.com/x86_machine_code_statistics NOT does not even make it to the list of mentioned instructions). C.ZEXT.WU may be worth it if people don’t want the WU ADD’s
Page 33
Encodings are a bitch. Perhaps MIN[U] and MAX[U] should be placed with SLT and SLTU and
ADDU.W and SUBU.W (aka ADD.WU and SUB.WU) should be placed with ADD rather than ADDW
Page 38 Section 2.11 Future 64 bit instructions for bitfield extract and place
I would take this section out.
Page 39 Section 2.12 Micro architectural considerations and macro-op fusion for bit manipulation
Page 40 Fused load-immediate sequences
Comment.
Clifford rightly started a discussion about long immediates (and long JAL!!!) maybe mentioning that this is being discussed is useful!
Comment
If PACK is changed to have PACK.BU, PACK.HU PACK.WU and PACK.DU, the PACK.WU version should be used in the examples.
Page 41:
S/32-bit numbers that have their MSB set/zero extended 32-bit numbers/
(Ie. they don’t have there MSB set but _unset_!)
Comment
If it is decided that ADDWUI should have a zero extended 12 bit immediate, mention that with the “And finally.
Page 41 Section 2.12.3 Fused *-not sequences
Is postfix NOT still very important with a ANDN ORN XNOR NAND and NOR instruction?
Page 42 Section 2.12.4
I think the extraction sequence should be
slli rd rs1 (-(pos + len)) % XLEN
srli rd rd (-len) % XLEN
this always makes sense, and does “the right thing” for 0 <= pos + len <= XLEN, pos >= 0, and
len >= 0 except pos == len == 0 :-(. It also shows the use of SLLNEG and SRLNEG for the non immediate case)
I think the insertion sequence of a bit field of length len starting at bit 0 is wrong and should be
slli rd rs1 (-len) % XLEN
srli rd rd (-(len + pos)) % XLEN
This also does “the right thing” except for pos == len == 0
Which other sequences with postfix srli an slli have applications ?
Comment:
The elephant in the room is to what extent to standardise if compilers should recognise recommended fused sequences, what sequences to recommend, and to what extent compiler writers are supposed to respect these sequences.
Page 42 section 2.12.6 Fused ternary operations
I would add this as a general remark in the previous section that all arithmetic ops of the form
Alu-op1 rd rs1 rs2
Alu-op2 rd rd rs3 (rd != rs3)
(And for that matter
Alu-op1 rd rs1 rs2
Alu-op2 rd rs3 rd )
Are amenable to fusing to an alu-op12 rd rs1 rs2 rs3, and that this is the preferred way of dealing with 3 input register all-ops if possible. Note the rd != rs3 restriction. This extends to three or more alu-op sequences where even more restrictions may be needed (see e.g. the sequences for CMOV and CMIX) Should such “designed to be fused” sequences be preceded with a hint?
Page 43.
The XNOR macro is already defined as an instruction with the same semantics.
Comment:
I proposed that NAND and NOR should also be instructions with the same semantics. See the logical instructions section.
The BFMAK macro definitions seems wrongly defined and I think it should be simply
bfmak rd, len, pos —> sloi rd, zero, len, c.slli rd rd pos
However, it seems to me that it is better to simply use
li rd ((1 << len) -1) << pos
and teach the compiler/loader about sloi as one of its “myriad of sequences” to generate a constant.
Bikeshed:
If you want to keep BFMAK, can we call it BFMAKE or LBFI (for load bitfield immediate)
Please use ((-len-pos)%XLEN) instead of (XLEN - len -pos) and similarly for other shift amounts.
Bikeshed:
Maybe beginpos, (one past the) endpos as parameters for the BFEXT[U] macros is better than len, pos This should make no difference for code generation but avoids the confusion whether it is len, pos or pos, len.
Page 46: section 3.1.3
Select could be simplified with sbset
Select:
sbset a1, zero, a1
bdep a0, a1, a0
ctz a0, a0
ret
(Clever trick by the way)
Rank could be done with one less instruction with SLLNEG if it counts the number up to a1 (but that uses modular arithmetic mod XLEN and counts _all_ the bits if a1 = 0)
Rank:
sllneg a0, a1
pcnt a0 a0
ret
Page 46:
Can one use GREV instead of SHFL here?
One should really not have to use SHFL to pack bytes in a word. With PACK.BU and PACK.HU it becomes the straightforward
pack.bu a0 a0 a1 # pack least significant half word
pack.bu a2 a2 a3 # pack most significant half word
pack.hu a0 a0 a2 # pack word
Page 47
Bikeshed:
Please write
x = x >> (t % 64)
Even though, of course, &63 does the same as %64.
For a good comparison, it would be better to show that the (x-1) in x |~(x -1) (== x | (-x) ) requires clobbering a register.
another 4 instruction implementation not clobbering an extra register (but with a jump) is
uint64_t rfill_clzsro(uint64_t x)
{
if (x){ //JEQZ
x = clz(x); //CLZ
x = sro(0, x % 64); //SRO
x = ~x; //NOT
}
return x;
}
Page 48: 3.1.6 Round to next power of two
The number 0 is not a power of two, but this may be conventional use anyway.
This example would also work with SLLNEG instead of ROR
Page 50:
Comment:
If I count correctly, with SLLNEG and AND.NE but no FSR and CMOV the parse_27bit example would be equally long because AND.NE is CMOV with rs3 == zero and the FSR expands in 3 instructions, one of which is the SLL below the FSR and is computed already (I think).
Page 51: Using butterfly operations.
At this stage the reader might need to be reminded on the argument conventions CMIX ie a comment like
cmix a0, a1, a2, a0 # a0 = (a2 & a1) | (a0 &~ a1)
Page52
The butterfly function is undefined and seems to depend on parameters other than just the stage number.
Page 53 Using baseline Networks
Again the butterfly(0) seems underspecified and this section or how it relates to the instruction set is not clear.
Page 56. 3.4.3 Eplanation
Working this text into the corresponding instruction definition sections would be very helpful. For example you use the notation introduced in this section without explanation in the table of standard operations in the GZIP section.
IMO the explanation of what GZIP does is still suboptimal however.
Page 60
I think you mean remainder instead of the (German) rest.
Page 61
These decode examples could probably use a C code version as a companion. How did you construct them ?
As a OT side remark: To use an instruction decoder to emulate an instruction in M mode, even the obvious improvements provided by the B extension are probably not enough to make software emulation practical if the instruction has to be parsed from the beginning. It would seem much easier if upon a trap the processor parses the instruction format for which it has lots of hardware and provides CSR’s with
0 trapping address (this the CPU does already)
1 Instruction opcode including the full function fields
2 TRAP_IMPL address of an implementation function for this opcode or opcode_unimplemented function.
3 TRAP_RS1 Content rs1
4 TRAP_RS2 Content of rs2 or immediate
5 TRAP_RS3 content of rs3 if it exists
The trap could then do something like
CSRRW a0 TRAP_RS1 a0 #exchange a0 and TRAP_RS1; saves a0
CSRRW a1 TRAP_RS1 a1 #exchange a0 and TRAP_RS2; saves a1
CSRRW ra TRAP_IMPL ra #exchange ra and TRAP_IMPL; saves ra
JALR ra ra
CSRRW ra TRAP_IMPL ra #exchange ra and TRAP_IMPL; restores ra
CSRRW a1 TRAP_RS1 a1 #exchange a0 and TRAP_RS2; restores a1
CSRRW a0 TRAP_RS1 a0 #exchange a0 and TRAP_RS1; restores a0
MRET
Obviously this is naive and more song and dance is needed, at the trap boundary but the point is not to do the parsing of the instruction “by hand”. Also obviously outside the scope of the B extension.
==== END OF REVIEW====
Whilst ROR etc seem like a good idea for cryptographic applications, it was when I saw GF(2) MUL as well that I thought, "hang on a minute, the op count is getting alarmingly high. GFMUL is not exactly BitManip, it's a specialised function".
I would strongly suggest that anything not general purpose (with at least two common uses in completely different applications fields) NOT go into BitManip but into their own extension. Not even as a Zbitmanip*: out of BitManip entirely.
Otherwise BitManip becomes something of a "dumping ground" aka kitchen sink.
That risks increasing the workload for BitManip and, not only that, risks overloading OP32 and RVC, making further extensions extremely challenging.
Yes there is ISAMUX/ISANS however it is not to be abused, being a safety valve and/or applications specific enabler, not a general purpose OP-extender.
Also, ternary operators - three src operands and 1 dest - are incredibly expensive. The only way to encode four registers (4x5=20bits) is to take an ENTIRE major OP32.
I think there is only one 4 reg instruction in RV : FMAC.
RISC is not about putting in as much as possible, it is about doing as little as possible in the ISA and still being incredibly efficient and effective.
Ternary operators completely break that rule. Red flag.
The long immediates idea: separate discussion, reminder that for completeness a proper comparison is needed (using the existing instruction lengths)
Also, in the Mill Architecture it has "compressed" immediates. 0 is encoded in a single bit. -1 may be encoded in 9 bits.
My feeling is that any comparison of long immediate proposals should also include compression algorithms (simple ones, of the order of Huffmann encoding)
For another thread, that one.
L.
* If the C code definitions involving shifts I prefer to write something like rs1 << (rs2 % LEN) because the shifts in RV are mod LEN. wherever a shift is made in C code, to emphasise that shifts in RV are mod XLEN, 32 (for W) or 64 (for D). This also makes it slightly clearer that it makes sense to shift over negative numbers.
As a result I avoid definitions like shamt = rs2 & (LEN - 1) which, of course, is equivalent but conceptually less clear.
Ok so read everything, must re-read several times. First impressions:
Whilst ROR etc seem like a good idea for cryptographic applications, it was when I saw GF(2) MUL as well that I thought, "hang on a minute, the op count is getting alarmingly high. GFMUL is not exactly BitManip, it's a specialised function".
On Mon, Sep 2, 2019, 13:55 Rogier Brussee <rogier...@gmail.com> wrote:
* If the C code definitions involving shifts I prefer to write something like rs1 << (rs2 % LEN) because the shifts in RV are mod LEN. wherever a shift is made in C code, to emphasise that shifts in RV are mod XLEN, 32 (for W) or 64 (for D). This also makes it slightly clearer that it makes sense to shift over negative numbers.
As a result I avoid definitions like shamt = rs2 & (LEN - 1) which, of course, is equivalent but conceptually less clear.
For the most common C implementations, a % b for negative a returns negative integers (see https://gcc.godbolt.org/z/r72dnQ ) which are undefined behavior as a shift amount according to most compilers (and the C standard) (for example, if the compiler compiled directly to 32-bit ARM assembly without checking for undefined behavior, 1 << (-1 % 16) returns 0 on ARM, rather than 0x10000, since -1 % 16 == -1 and shifts on ARM reduce the shift amount mod 0x100 (rather than mod 32) producing a shift by 255). % is generally called the remainder operation instead of mod. Using the mask allows always getting non-negative results, avoiding shifting by a negative number. maybe it would be worthwhile to define an auxillary function mod to calculate the modulo:
Op dinsdag 3 september 2019 03:12:37 UTC+2 schreef Jacob Lifshay:On Mon, Sep 2, 2019, 13:55 Rogier Brussee <rogier...@gmail.com> wrote:
* If the C code definitions involving shifts I prefer to write something like rs1 << (rs2 % LEN) because the shifts in RV are mod LEN. wherever a shift is made in C code, to emphasise that shifts in RV are mod XLEN, 32 (for W) or 64 (for D). This also makes it slightly clearer that it makes sense to shift over negative numbers.
As a result I avoid definitions like shamt = rs2 & (LEN - 1) which, of course, is equivalent but conceptually less clear.
For the most common C implementations, a % b for negative a returns negative integers (see https://gcc.godbolt.org/z/r72dnQ ) which are undefined behavior as a shift amount according to most compilers (and the C standard) (for example, if the compiler compiled directly to 32-bit ARM assembly without checking for undefined behavior, 1 << (-1 % 16) returns 0 on ARM, rather than 0x10000, since -1 % 16 == -1 and shifts on ARM reduce the shift amount mod 0x100 (rather than mod 32) producing a shift by 255). % is generally called the remainder operation instead of mod. Using the mask allows always getting non-negative results, avoiding shifting by a negative number. maybe it would be worthwhile to define an auxillary function mod to calculate the modulo:The equivalence of "%N" and "&(N-1)" holds for _unsigned_ integers uint_xlen_t which do modular artitmatic mod 2^XLEN and N a power of two (such as XLEN = 2^5 or 2^6). Negative numbers like -8 still make perfect sense of course.
int mod(int a, int b) // replace types as necessary{int rem = a % b;return (rem + b) % b;}it would be used in the shift operations as follows:unsigned shift_demo(unsigned a, int b){return (a >> mod(b, LEN)) | (a << mod(-b, LEN));}Jacob Lifshay
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/af41ad27-5eb1-4959-94ca-3341e2f3e057%40groups.riscv.org.
Review of Xbitmanip V0.9 (June 10)
I’m assuming that means RISC-V Bitmanip v0.90. (That’s the June 10 release.)
Just a few days ago I released RISC-V Bitmanip v0.91. (Note that it’s not called XBitmanip anymore ever since we became an official RISC-V task group again.)
You can always find the latest version of the bitmanip spec (or something very close to the latest version) at
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf
Your review is over 20 pages long, so I’m keeping my replies to what I think are the most important points.
It took me over 5 hours to work through your entire review and write the reply below.
So I hope you can understand that I don’t have the time to reply to everything you wrote.
Please open an issue at https://github.com/riscv/riscv-bitmanip/issues/new if you feel like any one suggestion I skipped or dismissed in my reply should see more debate.
I’m skipping a lot of stuff that’s imo bikeshedding (including a lot of stuff that you have not marked as bikeshedding). I’ve also skipped a lot of proposals to re-do large portions of riscv-bitmanip as it’s about a year too late for that.
For many sections it is not quite clear what the W versions (for RV64) or D versions (for RV128) are supposed to do. I will try to indicate how I understand it works.
The general semantic and reasoning behind *W instructions is explained in the base ISA spec. They always just perform the RV32 operation on RV64. I don’t see it in the scope of the Bitmanip spec to explain *W instructions as a concept.
It would be useful if a rationale for the use of an instruction would be given and quantitative data on use were available.
We are working on that. Right now we are building the tools. Can’t benchmark anything if you can generate the code.
Many sections contain C code as a way to define what the instruction is supposed to be doing. I like this because it is usually much easier to follow than words. However, I think there is value in making the definition as conceptually simple as possible even if it is (much) slower, and I will suggest some. Can we move faster but less intuitive implementations to an appendix?
These ARE the simple versions. The fast versions can be found in the “Fast C reference implementations” section within the “Reference Implementations” chapter.
* If the order of a loop is not important I use this macro
/* for loop that does not care about order: fake it by running backwards */
#define FOREACH( I , BEGIN, END) for( I = (END) - 1 ; I >= (BEGIN) ; —I)
You can’t be serious. I think we just have very different ideas of what “simple” C code is. The reference implementations won’t be changed to use stuff like that.
Comment: It seems to me that the B extension has moved to “integer instructions that extend or optimise the basic I instructions. I think this is a good thing, and I would argue that the B extension should position as slightly cheaper than the M extension, [...]
That’s what the Zbb extension is.
For the W versions set LEN = 32.
I’m not adding that to the description of every instruction. I’ve now added one paragraph to the introduction that reiterates what *W instructions do.
Please add
Clz and ctz are used for finding the first elements of a bit map. In particular it is used for the implementation of malloc. Clz is also used for soft float
Tbh I don’t think someone reading this document would need to be educated about the usefulness of clz/ctz, but I’ve added a few more applications to the clz/ctz section now.
Comment: I like that all basic logical operations have a “n” version.
Is nand and nor not also worth it (the RVV mask instructions seem to have them for example) ?
Because of instructions like SUB there is an existing input inverter that can be used for ANDN, ORN, and XNOR, controlled by the same instruction bit that is used in those new instructions. The cost of adding ANDN, ORN, and XNOR is virtually zero. The cost of adding NAND/NOR would not be.
please add
The pack instruction can be used to pack “struct” data that fits in a XLEN sized register and is passed in registers.
I’ve added language to that effect.
Proposal:
I really like the pack instructions. However I would propose to have 4 and XLEN independent pack instructions:
pack.bu rd rs1 rs2 # rd = zext(rs1, 8) | ( zext(rs2, 8) << 8 )
pack.hu rd rs1 rs2 # rd = zext(rs1, 16) | (zext(rs2, 16) << 16)
pack.wu rd rs1 rs2 # rd = zext(rs1, 32) | (zext(rs2, 32) << 32). //(RV64)
pack.du. rd rs1 rs2 # rd = zext(rs1,64) | (zext(rs2, 64) << 64) //(RV64, 128)
The only really new instruction in that list is pack.bu, and it replaces only two instructions (pack+shuffle).
Your pack.hu is just packw (RV64) or pack (RV32), except that packw fills the upper 32-bit with sign extension.
Your pack.wu is just pack on RV64 (or packd on RV128).
Your pack.du presumably is just packw on RV128.
Page 7.
Comment.
One should not need to use shfl instructions to pack bytes. See proposal for pack.bu.
One should not need a pack.bu instruction to pack bytes. See shfl instructions. ;)
The ZEXT pseudo instruction could also use ADDWU. Also this is (as far as I can see) the only mention of RV128. It should perhaps be mentioned everywhere (preferably) or nowhere.
There’s nothing to gain by using ADDWU instead of PACK[D] for ZEXT.W. Using PACK for all ZEXT where it’s possible is more consistent.
Any mention of RV128 in the bitmanip spec is just commentary considering that there is not even a ratified RV128 base ISA spec.
When I mention RV128 it is only show that we have considered implications for RV128. (For example wrt encoding of shift-like instructions.)
Page 7 section 2.1.5 Min/Max
Proposal:
I would propose to also add SGE and SGEU. As is done in RI5CY/PULP, this would be nicely placed with the SLT and SLTU instructions (i.e. have the same func3 bits as SLT and SLTU see proposal in section 2.1.3 ) making each a 2 bit family
That would replace a sequence of only two easily fusable instructions. (SLT/SLTU + XORI)
A while back I disassembled the entire debian base image to figure out what percentage of instructions would be sequences matching ==, !=, ||, or &&, to estimate if it would be worth adding instructions for those. This was the result:
total 10489586
seteq 634 (0.0060%)
setne 1856 (0.0177%)
lor 353 (0.0034%)
lnor 243 (0.0023%)
I have not done a similar experiment for SGE or SGEU.
You are welcome to perform a similar experiment for SGE/SGEU and propose the instructions via github issue if it turns out they’d see a lot of usage. Here are my scripts for reference:
Page 8
Taking the absolute value of an integer is a generally useful operation and I think it is not needed to mention SAT solvers.
I’ve made the language a bit more generic. But note that even though it’s a generally useful operation, there are few applications that need it so frequently that it has measurable performance impacts. (That is of course in big part because calculating abs() isn’t very expensive even without min/max.) Therefore I think it does make sense to mention SAT solvers as a concrete application.
Page 8. Section 2.1.6 sbset/sbclr/sbinv/sbext
Please add:
These instructions efficiently support bitfields, and avoid costly constant constructions and nasty surprises especially if the bitfields are not in low bits.
In the context of bit manipulation instructions, the term “bitfield” is commonly used in the context of instructions that manipulate contiguous regions of bits. (See for example the ARM instructions BFC, BFI, SBFX, etc.)
These are single bit instructions, not bitfield instructions.
I think the s prefix should be reserved for instructions returning booleans (0/1), which means only SBEXT is properly named. I would suggest BSET1, BCLR1, BINV1, and SBEXT1 or SBGET1 (but see below)
You don’t even say why you feel that the “S” prefix should be preserved for such instructions..
Also, to late: There’s already SB SH SW SLL[I] SRL[I] SRA[i] and the SEXT.W pseudo-instruction in the base ISA.
What is meant with “bitwise prefix sum of the bits in the lower byte of a 32 bit word on RV32” here?
https://en.wikipedia.org/wiki/Prefix_sum
https://en.wikipedia.org/wiki/Bitwise_operation
The example code in question is described as following in the spec:
“For example, the following code calculates the bitwise prefix sum of the bits
in the lower byte of a 32 bit word on RV32. The final prefix sum is stored in the 8 nibbles of the output word.”
I.e. when A[i] is bit i of the input, and B[i] is nibble i of the output, then
B[0] = A[0]
B[1] = A[0] + A[1]
B[2] = A[0] + A[1] + A[2]
etc.
Since you already say that they often are not faster than software emulation, and one of the first hits you get for PEXT is that AMD implements them horribly slow, maybe they should not be in B but still be defined as Z extension.
Did you even read the introduction section of the spec chapter?
The CLMUL and CLMULH/ CLMULR instructions implement polynomial multiplication over the Galois field GF(2) and the closely related multiplication in the higher Galois fields GF(2^n). They are used extensively in cryptography and can be used to compute CRC’s which are defined by computations in Galois fields.
They don’t implement multiplication over GF(2^n). That would be carry-less product followed by reduction. So implementing GF(2^n) products is an _application_ of clmul, but it is not what clmul does. Using Barrett reduction a GF(2^n) product can be implemented with 4 clmul operations: Two clmul for the polynomial GF(2) multiplication and two more for the reduction.
I’ve added the sentence “Carry-less multiplication is equivalent to multiplication in the polynomial ring over GF(2)”.
Page 24
S/GCM/GCD/GCM (greatest common divisor/multiple) of binary polynomials/
GCM means Galois/Counter Mode here.
I’ve replaced “CRC and GCM” with “Cyclic Redundancy Check (CRC) and Galois/Counter Mode (GCM)”.
What is the prefix XOR operation?
for (i=0..XLEN-1): out[i] = ^in[i:0]
where X[i] is bit number i, X[i:0] is the i+1 LSB bits and ^X is XOR reduce.
Like Prefix-sum but with xor instead of addition.
Isn’t a CLMUL instruction easier and cheaper than a MULT instruction, so may be implemented even if M is not implemented? ( I may be wrong here and obviously a MULT instruction has more general benefit).
It is cheaper but not so cheap that it’s likely that an implementation would bother with fast CLMUL if they don’t even bother with fast MUL.
I like the binary metrix operations. However, maybe it is enough to just define the more primitive multiplication between a matrix and a vector. We can then also have different sized (power of two by power of two) sized matrices multiplied to vectors.
The HW cost for bmat[x]or is not as big as one might expect.
My reference single-cycle implementation of bmat[x]or mapped to a generic 4-LUT FPGA architecture is 706 LUTs in size. That’s less than the size of five 64-bit adders.
Being able to for example perform an arbitrary permutation of bytes in a word in a single instruction is interesting. Doing it in 8 instructions doesn’t really add much.
Ternary operators have pushback because they are a burden on the micro architecture
That’s why they are in a separate Z extension.
Does changing the order of the registers in the assembler compared to the encoding not create more confusion than it solves (I genuinely don’t know).
Feel free to discuss at https://github.com/riscv/riscv-bitmanip/issues/11
Apparently it would be “insane” not to do it. Let me know when you two have agreed on whether changing the order is “confusing” or not changing it would be “insane”..
shows that the immediate versions(where the compiler can do the test) are just three instructions
and we have very long discussion within the task group and ultimately decided that it would be worth to include them anyway. I agree with that decision.
What does ADDIWU do? Does it zero or sign extend its 12 bits immediate? (I would suggest zero extending is more useful as it increases the range of constants that can be defined in one instruction)
The immediate is sign-extended, exactly like the immediate to ADDIW. I’ve now added a sentence clarifying that.
The i^j example is not terribly good as i^j happens to be already zero extended, if i and j are properly sign extended.
No. i^j is sign extended if i and j are sign extended. (That’s why there is no XORW instruction.)
For example: 0x80000000 ^ 0x00000000
With sign extended inputs:
0xFFFFFFFF80000000 ^ 0x0000000000000000 = 0xFFFFFFFF80000000
The SLLIU.W instruction is used to compute addresses of array elements indexed by unsigned integers.
All five “unsigned address calculation instructions” are.
Comment:
Is there no SLLU.W (aka SLL.WU) instruction?
What would we need that for?
It might be slightly better to have rs1 zero extended instead of rs2. That leaves open the possibility to have a slightly more regular C instruction
That would zero extend the wrong input for subu.w if the idea is to use it for the address calculation use-case.
I think the scheme could and should be extended to bytes and halves. [..]
Together that gives the following add instructions :
#add #addw #addd
addsl1 addwsl1 adddsl1
addsl2 addwsl2 adddsl2
addsl3 addwsl3 adddsl3
addsl4 addwsl4 adddsl4
addsl5 addwsl5 adddsl5
addsl6 addwsl6 adddsl6
addsl7 addwsl7 adddsl7
add.b addw.b addd.b
add.h addw.h addd.h
add.w addd.w
add.d
add.bu addwu.bu adddu.bu
add.hu addwu.hu adddu.hu
add.wu addwu adddu.wu
add.du adddu
With the pattern repeating for sub.
If I counted correctly that would mean adding 82 new R-type instructions. That’s over half of a minor opcode. You can’t be serious.
Page 32, Section 2.10 Future compressed instructions
I would take this section out, or just suggest that C.NOT, C.NEG and C.ZEXT.W and C.ZEXT.D may be worthwhile additions. More quantitative data is needed however, and this just cannot be done just for the B extension. For example I think another instruction that may be worthwhile is
Did you read the part where it says this is a recommendation for a future revision of “C”, not part of the “B” extension?
Page 41:
S/32-bit numbers that have their MSB set/zero extended 32-bit numbers/
(Ie. they don’t have there MSB set but _unset_!)
Here is a 32-bit number with its MSB set: 0x82341234
Sign extended: 0xFFFFFFFF82341234
Zero extended: 0x0000000082341234
You can’t load the zero extended version on RV64 with LUI+ADDI.
But you can load it with LUI+ADDIWU.
Page 42 Section 2.12.4
I think the extraction sequence should be
slli rd rs1 (-(pos + len)) % XLEN
srli rd rd (-len) % XLEN
I think taking the remainder of a negative number is a bad idea in many contexts, but especially in something like a specification, considering that in languages like C the behavior is implementation-defined.
this always makes sense, and does “the right thing” for 0 <= pos + len <= XLEN, pos >= 0, and len >= 0 except pos == len == 0 :-(.
I’m not even sure what you imply “the right thing” is for len=0, and pos!=0.
I think the insertion sequence of a bit field of length len starting at bit 0 is wrong and should be
slli rd rs1 (-len) % XLEN
srli rd rd (-(len + pos)) % XLEN
I fixed the issue, but kept my form that does not require modulo of negative numbers.
This also does “the right thing” except for pos == len == 0
Again, not seeing how leaving the register unchanged (left shift by zero, right shift by zero) is “the right thing”.
The elephant in the room is to what extent to standardise if compilers should recognise recommended fused sequences, what sequences to recommend, and to what extent compiler writers are supposed to respect these sequences.
There is no requirement for a compiler or for a processor to implement any of those.
The BFMAK macro definitions seems wrongly defined and I think it should be simply
bfmak rd, len, pos —> sloi rd, zero, len, c.slli rd rd pos
That’s just an equivalent definition. The difference is that your definition does not match the *-srli/*-srai fusion pattern, but the definition in the spec does.
And the description for that list of three pseudo-ops literally says “bitfield pseudo-ops for sr[la]i postfix fusion”, so I think the sequences should actually match that fusion pattern.
Please use ((-len-pos)%XLEN) instead of (XLEN - len -pos) and similarly for other shift amounts.
I’m not going to use C-like expressions that rely on implementation defined C behavior in a spec document.
Page 46: section 3.1.3
Select could be simplified with sbset
That’s correct. That example was written before we added SB* instructions. I’ve changed it now accordingly.
Page 46:
Can one use GREV instead of SHFL here?
I don’t think so. I wouldn’t know how.
One should really not have to use SHFL to pack bytes in a word. With PACK.BU and PACK.HU it becomes the straightforward
You are saying that as if it would be obvious what’s bad about using shuffle for that. It’s not obvious to me. Why do you think we need to invent a new instructions to do something that can already be done with SHFL?
Page 48: 3.1.6 Round to next power of two
The number 0 is not a power of two, but this may be conventional use anyway.
That’s why the text contains the clause “i.e. equivialent to rfill(x-1)+1.”
Page52
The butterfly function is undefined and seems to depend on parameters other than just the stage number.
The first paragraph in that section defines what a stage-N butterfly operation is.
Page 60
I think you mean remainder instead of the (German) rest.
Oops. :D Fixed.
Page 61
These decode examples could probably use a C code version as a companion. How did you construct them ?
I created them with the use of an SMT solver.
Regards,
- Clifford
Hi,Review of Xbitmanip V0.9 (June 10)I’m assuming that means RISC-V Bitmanip v0.90. (That’s the June 10 release.)
Just a few days ago I released RISC-V Bitmanip v0.91. (Note that it’s not called XBitmanip anymore ever since we became an official RISC-V task group again.)
You can always find the latest version of the bitmanip spec (or something very close to the latest version) at
https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf
Your review is over 20 pages long, so I’m keeping my replies to what I think are the most important points.
It took me over 5 hours to work through your entire review and write the reply below.
So I hope you can understand that I don’t have the time to reply to everything you wrote.
Please open an issue at https://github.com/riscv/riscv-bitmanip/issues/new if you feel like any one suggestion I skipped or dismissed in my reply should see more debate.
I’m skipping a lot of stuff that’s imo bikeshedding (including a lot of stuff that you have not marked as bikeshedding). I’ve also skipped a lot of proposals to re-do large portions of riscv-bitmanip as it’s about a year too late for that.
For many sections it is not quite clear what the W versions (for RV64) or D versions (for RV128) are supposed to do. I will try to indicate how I understand it works.The general semantic and reasoning behind *W instructions is explained in the base ISA spec. They always just perform the RV32 operation on RV64. I don’t see it in the scope of the Bitmanip spec to explain *W instructions as a concept.
It would be useful if a rationale for the use of an instruction would be given and quantitative data on use were available.We are working on that. Right now we are building the tools. Can’t benchmark anything if you can generate the code
.
Many sections contain C code as a way to define what the instruction is supposed to be doing. I like this because it is usually much easier to follow than words. However, I think there is value in making the definition as conceptually simple as possible even if it is (much) slower, and I will suggest some. Can we move faster but less intuitive implementations to an appendix?These ARE the simple versions. The fast versions can be found in the “Fast C reference implementations” section within the “Reference Implementations” chapter.
* If the order of a loop is not important I use this macro
/* for loop that does not care about order: fake it by running backwards */
#define FOREACH( I , BEGIN, END) for( I = (END) - 1 ; I >= (BEGIN) ; —I)You can’t be serious. I think we just have very different ideas of what “simple” C code is. The reference implementations won’t be changed to use stuff like that.
Comment: It seems to me that the B extension has moved to “integer instructions that extend or optimise the basic I instructions. I think this is a good thing, and I would argue that the B extension should position as slightly cheaper than the M extension, [...]That’s what the Zbb extension is.
For the W versions set LEN = 32.I’m not adding that to the description of every instruction. I’ve now added one paragraph to the introduction that reiterates what *W instructions do.
Please add
Clz and ctz are used for finding the first elements of a bit map. In particular it is used for the implementation of malloc. Clz is also used for soft floatTbh I don’t think someone reading this document would need to be educated about the usefulness of clz/ctz, but I’ve added a few more applications to the clz/ctz section now.
Comment: I like that all basic logical operations have a “n” version.
Is nand and nor not also worth it (the RVV mask instructions seem to have them for example) ?Because of instructions like SUB there is an existing input inverter that can be used for ANDN, ORN, and XNOR, controlled by the same instruction bit that is used in those new instructions. The cost of adding ANDN, ORN, and XNOR is virtually zero. The cost of adding NAND/NOR would not be.
please add
The pack instruction can be used to pack “struct” data that fits in a XLEN sized register and is passed in registers.I’ve added language to that effect.
Proposal:
I really like the pack instructions. However I would propose to have 4 and XLEN independent pack instructions:
pack.bu rd rs1 rs2 # rd = zext(rs1, 8) | ( zext(rs2, 8) << 8 )
pack.hu rd rs1 rs2 # rd = zext(rs1, 16) | (zext(rs2, 16) << 16)
pack.wu rd rs1 rs2 # rd = zext(rs1, 32) | (zext(rs2, 32) << 32). //(RV64)
pack.du. rd rs1 rs2 # rd = zext(rs1,64) | (zext(rs2, 64) << 64) //(RV64, 128)The only really new instruction in that list is pack.bu, and it replaces only two instructions (pack+shuffle).
Your pack.hu is just packw (RV64) or pack (RV32), except that packw fills the upper 32-bit with sign extension.
Your pack.wu is just pack on RV64 (or packd on RV128).
Your pack.du presumably is just packw on RV128.
Page 7.
Comment.
One should not need to use shfl instructions to pack bytes. See proposal for pack.bu.One should not need a pack.bu instruction to pack bytes. See shfl instructions. ;)
The ZEXT pseudo instruction could also use ADDWU. Also this is (as far as I can see) the only mention of RV128. It should perhaps be mentioned everywhere (preferably) or nowhere.There’s nothing to gain by using ADDWU instead of PACK[D] for ZEXT.W. Using PACK for all ZEXT where it’s possible is more consistent.
Any mention of RV128 in the bitmanip spec is just commentary considering that there is not even a ratified RV128 base ISA spec.
When I mention RV128 it is only show that we have considered implications for RV128. (For example wrt encoding of shift-like instructions.)
Page 7 section 2.1.5 Min/Max
Proposal:
I would propose to also add SGE and SGEU. As is done in RI5CY/PULP, this would be nicely placed with the SLT and SLTU instructions (i.e. have the same func3 bits as SLT and SLTU see proposal in section 2.1.3 ) making each a 2 bit familyThat would replace a sequence of only two easily fusable instructions. (SLT/SLTU + XORI)
A while back I disassembled the entire debian base image to figure out what percentage of instructions would be sequences matching ==, !=, ||, or &&, to estimate if it would be worth adding instructions for those. This was the result:
total 10489586
seteq 634 (0.0060%)
setne 1856 (0.0177%)
lor 353 (0.0034%)
lnor 243 (0.0023%)
I have not done a similar experiment for SGE or SGEU.
You are welcome to perform a similar experiment for SGE/SGEU and propose the instructions via github issue if it turns out they’d see a lot of usage. Here are my scripts for reference:
Page 8
Taking the absolute value of an integer is a generally useful operation and I think it is not needed to mention SAT solvers.I’ve made the language a bit more generic. But note that even though it’s a generally useful operation, there are few applications that need it so frequently that it has measurable performance impacts. (That is of course in big part because calculating abs() isn’t very expensive even without min/max.) Therefore I think it does make sense to mention SAT solvers as a concrete application.
Page 8. Section 2.1.6 sbset/sbclr/sbinv/sbext
Please add:
These instructions efficiently support bitfields, and avoid costly constant constructions and nasty surprises especially if the bitfields are not in low bits.In the context of bit manipulation instructions, the term “bitfield” is commonly used in the context of instructions that manipulate contiguous regions of bits. (See for example the ARM instructions BFC, BFI, SBFX, etc.)
These are single bit instructions, not bitfield instructions.
I think the s prefix should be reserved for instructions returning booleans (0/1), which means only SBEXT is properly named. I would suggest BSET1, BCLR1, BINV1, and SBEXT1 or SBGET1 (but see below)
You don’t even say why you feel that the “S” prefix should be preserved for such instructions..
Also, to late: There’s already SB SH SW SLL[I] SRL[I] SRA[i] and the SEXT.W pseudo-instruction in the base ISA.
What is meant with “bitwise prefix sum of the bits in the lower byte of a 32 bit word on RV32” here?https://en.wikipedia.org/wiki/Prefix_sum
https://en.wikipedia.org/wiki/Bitwise_operation
The example code in question is described as following in the spec:
“For example, the following code calculates the bitwise prefix sum of the bits
in the lower byte of a 32 bit word on RV32. The final prefix sum is stored in the 8 nibbles of the output word.”
I.e. when A[i] is bit i of the input, and B[i] is nibble i of the output, then
B[0] = A[0]
B[1] = A[0] + A[1]
B[2] = A[0] + A[1] + A[2]
etc.
Since you already say that they often are not faster than software emulation, and one of the first hits you get for PEXT is that AMD implements them horribly slow, maybe they should not be in B but still be defined as Z extension.Did you even read the introduction section of the spec chapter?
The CLMUL and CLMULH/ CLMULR instructions implement polynomial multiplication over the Galois field GF(2) and the closely related multiplication in the higher Galois fields GF(2^n). They are used extensively in cryptography and can be used to compute CRC’s which are defined by computations in Galois fields.They don’t implement multiplication over GF(2^n). That would be carry-less product followed by reduction. So implementing GF(2^n) products is an _application_ of clmul, but it is not what clmul does. Using Barrett reduction a GF(2^n) product can be implemented with 4 clmul operations: Two clmul for the polynomial GF(2) multiplication and two more for the reduction.
I’ve added the sentence “Carry-less multiplication is equivalent to multiplication in the polynomial ring over GF(2)”.
Page 24
S/GCM/GCD/GCM (greatest common divisor/multiple) of binary polynomials/GCM means Galois/Counter Mode here.
I’ve replaced “CRC and GCM” with “Cyclic Redundancy Check (CRC) and Galois/Counter Mode (GCM)”.
What is the prefix XOR operation?for (i=0..XLEN-1): out[i] = ^in[i:0]
where X[i] is bit number i, X[i:0] is the i+1 LSB bits and ^X is XOR reduce.
Like Prefix-sum but with xor instead of addition.
Isn’t a CLMUL instruction easier and cheaper than a MULT instruction, so may be implemented even if M is not implemented? ( I may be wrong here and obviously a MULT instruction has more general benefit).It is cheaper but not so cheap that it’s likely that an implementation would bother with fast CLMUL if they don’t even bother with fast MUL.
I like the binary metrix operations. However, maybe it is enough to just define the more primitive multiplication between a matrix and a vector. We can then also have different sized (power of two by power of two) sized matrices multiplied to vectors.The HW cost for bmat[x]or is not as big as one might expect.
My reference single-cycle implementation of bmat[x]or mapped to a generic 4-LUT FPGA architecture is 706 LUTs in size. That’s less than the size of five 64-bit adders.
Being able to for example perform an arbitrary permutation of bytes in a word in a single instruction is interesting. Doing it in 8 instructions doesn’t really add much.
Ternary operators have pushback because they are a burden on the micro architectureThat’s why they are in a separate Z extension.
Does changing the order of the registers in the assembler compared to the encoding not create more confusion than it solves (I genuinely don’t know).Feel free to discuss at https://github.com/riscv/riscv-bitmanip/issues/11
Apparently it would be “insane” not to do it. Let me know when you two have agreed on whether changing the order is “confusing” or not changing it would be “insane”..
shows that the immediate versions(where the compiler can do the test) are just three instructionsand we have very long discussion within the task group and ultimately decided that it would be worth to include them anyway. I agree with that decision.
What does ADDIWU do? Does it zero or sign extend its 12 bits immediate? (I would suggest zero extending is more useful as it increases the range of constants that can be defined in one instruction)The immediate is sign-extended, exactly like the immediate to ADDIW. I’ve now added a sentence clarifying that.
The i^j example is not terribly good as i^j happens to be already zero extended, if i and j are properly sign extended.No. i^j is sign extended if i and j are sign extended. (That’s why there is no XORW instruction.)
For example: 0x80000000 ^ 0x00000000
With sign extended inputs:
0xFFFFFFFF80000000 ^ 0x0000000000000000 = 0xFFFFFFFF80000000
The SLLIU.W instruction is used to compute addresses of array elements indexed by unsigned integers.All five “unsigned address calculation instructions” are.
Comment:
Is there no SLLU.W (aka SLL.WU) instruction?What would we need that for?
It might be slightly better to have rs1 zero extended instead of rs2. That leaves open the possibility to have a slightly more regular C instructionThat would zero extend the wrong input for subu.w if the idea is to use it for the address calculation use-case.
I think the scheme could and should be extended to bytes and halves. [..]
Together that gives the following add instructions :
#add #addw #addd
addsl1 addwsl1 adddsl1
addsl2 addwsl2 adddsl2
addsl3 addwsl3 adddsl3
addsl4 addwsl4 adddsl4
addsl5 addwsl5 adddsl5
addsl6 addwsl6 adddsl6
addsl7 addwsl7 adddsl7
add.b addw.b addd.b
add.h addw.h addd.h
add.w addd.w
add.d
add.bu addwu.bu adddu.bu
add.hu addwu.hu adddu.hu
add.wu addwu adddu.wu
add.du adddu
With the pattern repeating for sub.If I counted correctly that would mean adding 82 new R-type instructions. That’s over half of a minor opcode. You can’t be serious.
Page 32, Section 2.10 Future compressed instructions
I would take this section out, or just suggest that C.NOT, C.NEG and C.ZEXT.W and C.ZEXT.D may be worthwhile additions. More quantitative data is needed however, and this just cannot be done just for the B extension. For example I think another instruction that may be worthwhile isDid you read the part where it says this is a recommendation for a future revision of “C”, not part of the “B” extension?
Page 41:
S/32-bit numbers that have their MSB set/zero extended 32-bit numbers/
(Ie. they don’t have there MSB set but _unset_!)Here is a 32-bit number with its MSB set: 0x82341234
Sign extended: 0xFFFFFFFF82341234
Zero extended: 0x0000000082341234
You can’t load the zero extended version on RV64 with LUI+ADDI.
But you can load it with LUI+ADDIWU.
Page 42 Section 2.12.4
I think the extraction sequence should be
slli rd rs1 (-(pos + len)) % XLEN
srli rd rd (-len) % XLENI think taking the remainder of a negative number is a bad idea in many contexts, but especially in something like a specification, considering that in languages like C the behavior is implementation-defined.
this always makes sense, and does “the right thing” for 0 <= pos + len <= XLEN, pos >= 0, and len >= 0 except pos == len == 0 :-(.I’m not even sure what you imply “the right thing” is for len=0, and pos!=0
.
I think the insertion sequence of a bit field of length len starting at bit 0 is wrong and should be
slli rd rs1 (-len) % XLEN
srli rd rd (-(len + pos)) % XLENI fixed the issue, but kept my form that does not require modulo of negative numbers.
This also does “the right thing” except for pos == len == 0Again, not seeing how leaving the register unchanged (left shift by zero, right shift by zero) is “the right thing”.
The elephant in the room is to what extent to standardise if compilers should recognise recommended fused sequences, what sequences to recommend, and to what extent compiler writers are supposed to respect these sequences.There is no requirement for a compiler or for a processor to implement any of those.
The BFMAK macro definitions seems wrongly defined and I think it should be simply
bfmak rd, len, pos —> sloi rd, zero, len, c.slli rd rd posThat’s just an equivalent definition. The difference is that your definition does not match the *-srli/*-srai fusion pattern, but the definition in the spec does.
And the description for that list of three pseudo-ops literally says “bitfield pseudo-ops for sr[la]i postfix fusion”, so I think the sequences should actually match that fusion pattern.
Please use ((-len-pos)%XLEN) instead of (XLEN - len -pos) and similarly for other shift amounts.I’m not going to use C-like expressions that rely on implementation defined C behavior in a spec document.
Page 46: section 3.1.3
Select could be simplified with sbsetThat’s correct. That example was written before we added SB* instructions. I’ve changed it now accordingly.
Page 46:
Can one use GREV instead of SHFL here?I don’t think so. I wouldn’t know how.
One should really not have to use SHFL to pack bytes in a word. With PACK.BU and PACK.HU it becomes the straightforwardYou are saying that as if it would be obvious what’s bad about using shuffle for that. It’s not obvious to me. Why do you think we need to invent a new instructions to do something that can already be done with SHFL?
Page 48: 3.1.6 Round to next power of two
The number 0 is not a power of two, but this may be conventional use anyway.That’s why the text contains the clause “i.e. equivialent to rfill(x-1)+1.”
Page52
The butterfly function is undefined and seems to depend on parameters other than just the stage number.The first paragraph in that section defines what a stage-N butterfly operation is.
grevi a2, a0, (1 << N)
cmix a0, a1, a2, a0
with the corresponding change in the sections below.
Page 60
I think you mean remainder instead of the (German) rest.Oops. :D Fixed.
Page 61
These decode examples could probably use a C code version as a companion. How did you construct them ?I created them with the use of an SMT solver.
Regards,
- Clifford
Regarding the suggestion of extending the adduw/slluw etc to a whopping 82+ operations covering all permutations of 8/16/32 on all src and dest operands:
1. It conpletely breaks the RISC paradigm
2. *this is exactly what SimpleV's element width override is designed to provide*.
3. What you are suggesting, Rogier, despite there being good reasons, is exactly the sort of nightmare opcode proliferation behind SIMD. See "SIMD Considered harmful". SIMD is, at its heart, an O(n^6) opcode proliferation nightmare. Yes, really, 6 dimensions of options. You just hit 4 of them.
SimpleV provides a means to individually override the bitwidth of all and any of the operands, such that an instruction's behaviour covers 4x4x4-1 permutations that it otherwise could not possibly have.
It takes some setup, and was originally designed for vectorisation scenarios, where manipulating multiple packed byte (or 16 bit) elements would otherwise be prohibitively slow.
So in a single op case there is some overhead (32 bits). If however there are contiguous registers which require the same operation then that 32 bit overhead is a one-off cost for all the contiguous operations, and, furthermore, only a single instruction is needed to cover them all.
The actual SV elenent width override behaviour, when a vector is set (and VL=1) is that the top bits of the destination register must be left UNMODIFIED. This can save on about 2 to 3 instructions in packed structs which would otherwise require ORs and ANDs etc.
If however scalar mode is set and elwidth override set to 32 bit on the destination, ADD will do exactly what addu does, and ADD.W will do exactly what adduw does.
SV provides this functionality *WITH ZERO OPCODES ADDED*.
to reiterate: SV provides the functionality of adduw, subuw and slluw, and the full set of 82+ variants reccommended by Rogier, *even though there is not one single new instruction added by SV*.
Please consider that carefully as to whether the opcode space should be wasted, reading, absorbing and applying the lesson of "SIMD considered harmful".
Including whether adduw etc should be added at all (given particularly that SV provides them "for free")
I reiterate what I said earlier: BitManip has some very valuable and necessary opcodes, however compared to the original 0.35 and 0.36 xBitManip of 18 months ago, where Clifford's initial analysis was thorough, clear and extremely well justified, the "specialised" opcode proliferation present in the 0.90 spec which has little to no general purpose use-cases is extremely alarming and should be ringing massive alarm bells across the entire RISCV community.
The principles of RISC are that you pick the best suite of opcodes that are both easy to do in hardware yet also, for the majority of cases, give damn good power efficiency and compact code right across the computing industry.
That said: if it takes 82 additional opcodes to add something that is actually really useful and can be shown to significantly reduce code size in a large number of common cases, then there is something really badly wrong with the RISCV ISA (and with the whole RISC principle).
Rogier can it be shown how much code size would be reduced by those 82 opcodes, ans how common such opportunities are?
Is anyone able to assist with such an evaluation?
L.
For many sections it is not quite clear what the W versions (for RV64) or D versions (for RV128) are supposed to do. I will try to indicate how I understand it works.The general semantic and reasoning behind *W instructions is explained in the base ISA spec. They always just perform the RV32 operation on RV64. I don’t see it in the scope of the Bitmanip spec to explain *W instructions as a concept.
It is not the concept of W, which is perfectly clear, it is that you mention XLEN specifically in the C code (which should change to 32 for W) and uint_xlen_t (which should just stay what it is) and standards need to be excruciatingly explicit.
Right. It zero extends because that is needed for the use case of buildingstruct foo{short low, high;}
Taking the absolute value of an integer is a generally useful operation and I think it is not needed to mention SAT solvers.I’ve made the language a bit more generic. But note that even though it’s a generally useful operation, there are few applications that need it so frequently that it has measurable performance impacts. (That is of course in big part because calculating abs() isn’t very expensive even without min/max.) Therefore I think it does make sense to mention SAT solvers as a concrete application.
Maybe it is just me, but SAT solvers look like a bit of a niche application to me.
Page 8. Section 2.1.6 sbset/sbclr/sbinv/sbext
Please add:
These instructions efficiently support bitfields, and avoid costly constant constructions and nasty surprises especially if the bitfields are not in low bits.In the context of bit manipulation instructions, the term “bitfield” is commonly used in the context of instructions that manipulate contiguous regions of bits. (See for example the ARM instructions BFC, BFI, SBFX, etc.)
These are single bit instructions, not bitfield instructions.
Good point. I meant bitfield in the sense of the C language, i.e. an example likestruct nasty{unsigned long long counter:62;unsigned flag1: 1; // ouch, have to manipulate bit 62 and 63, good luck if your code depends on nasty being read in one go.unsigned flag2: 1:}
Since you already say that they often are not faster than software emulation, and one of the first hits you get for PEXT is that AMD implements them horribly slow, maybe they should not be in B but still be defined as Z extension.Did you even read the introduction section of the spec chapter?
Yes, and it has/had the BEXT /BDEP in the grey-shaded "suggested to be in B" block. But I guess one could implement Zbb instead.
If I counted correctly that would mean adding 82 new R-type instructions. That’s over half of a minor opcode. You can’t be serious.
I am serious. That there are so many ADD's comes from making it two families and internal logic of the encoding that says that if you have an instruction land the instruction doesn't preserve sign extension you must also have the corresponding W and D versions and it is cheap to do so because the proper sign extension hardware is there anyway. But note that every single one of them corresponds directly to (one of the admitted many) C language + operations and add's are bread and butter so it is OK if it takes up room. I think avoiding nasty surprises and being able to support C operations essentially 1 to 1 in the ISA is an important goal. One could reduce the size by half by ditching the sl and perhaps the b/bu versions
Page 42 Section 2.12.4
I think the extraction sequence should be
slli rd rs1 (-(pos + len)) % XLEN
srli rd rd (-len) % XLENI think taking the remainder of a negative number is a bad idea in many contexts, but especially in something like a specification, considering that in languages like C the behavior is implementation-defined.
Should have been more clear that we are dealing with _unsigned_ integers (like uint_xlen_t) and the % operator is perfectly well defined for "negative" integers . Systematically writing % XLEN just stresses the fact that shifts in RV have shift amount that is naturally thought of as (the equivalence class of) a number mod LEN with modular arithmetic mod LEN (I always thought this was clever and clean in RV).More importantly, an expression like rs1 << (XLEN - rs2) is undefined in C for rs2 == 0, while in RV it does rs1 << 0, i.e. uses rs1 << ((-rs2) % XLEN)
> this always makes sense, and does “the right thing” for 0 <= pos + len <= XLEN, pos >= 0, and len >= 0 except pos == len == 0 :-(.I’m not even sure what you imply “the right thing” is for len=0, and pos!=0
It should return 0 as you extract an empty string.
One should really not have to use SHFL to pack bytes in a word. With PACK.BU and PACK.HU it becomes the straightforwardYou are saying that as if it would be obvious what’s bad about using shuffle for that. It’s not obvious to me. Why do you think we need to invent a new instructions to do something that can already be done with SHFL?
Because I feel uncomfortable with SHFL :-) But seriously, SHFL is not in Zbb and It seems specialised to me. On the other hand packing two bytes in a halfword or two halfwords in a word, seems like the kind of thing that is needed for something bulk and common like packing network packets or RGBA data and perfectly reasonable to expect from Zbb.
Page52
The butterfly function is undefined and seems to depend on parameters other than just the stage number.The first paragraph in that section defines what a stage-N butterfly operation is.
Do you mean the two instruction unnamed (apparently #define type ) macro at the beginning of section 4.3.1 in the v0.91 version ?
Then I think it would be clearer to name it there and make the a0, a1 a2 explicit in the parameters i.e.#define butterfly(N, a0, a1, a2, a3)grevi a2, a0, (1 << N) cmix a0, a1, a2, a0
with the corresponding change in the sections below.
Hi,
On Wed, Sep 4, 2019 at 9:37 PM Rogier Brussee <rogier...@gmail.com> wrote:For many sections it is not quite clear what the W versions (for RV64) or D versions (for RV128) are supposed to do. I will try to indicate how I understand it works.The general semantic and reasoning behind *W instructions is explained in the base ISA spec. They always just perform the RV32 operation on RV64. I don’t see it in the scope of the Bitmanip spec to explain *W instructions as a concept.
It is not the concept of W, which is perfectly clear, it is that you mention XLEN specifically in the C code (which should change to 32 for W) and uint_xlen_t (which should just stay what it is) and standards need to be excruciatingly explicit.The bitmanip document currently states "Like in the base ISA, we add *W instruction variants on RV64 with the semantic of the matching RV32 instruction." That's sufficient imo.
Right. It zero extends because that is needed for the use case of buildingstruct foo{short low, high;}I don't see anything in the calling convention that would require the upper bits of the register holding an aggregate of that type to have any specified value. So simply using PACKW to pack this struct on RV64 would be perfectly fine afaict.
Taking the absolute value of an integer is a generally useful operation and I think it is not needed to mention SAT solvers.I’ve made the language a bit more generic. But note that even though it’s a generally useful operation, there are few applications that need it so frequently that it has measurable performance impacts. (That is of course in big part because calculating abs() isn’t very expensive even without min/max.) Therefore I think it does make sense to mention SAT solvers as a concrete application.
Maybe it is just me, but SAT solvers look like a bit of a niche application to me.Well, I did make the language more generic, mentioning SAT solver as one application.
But worry not, I can assure you that min/max does not only help for SAT solvers.
There have been some benchmarks showing significant speedup in specint just from MIN/MAX. But I can't reference that because it hasn't been properly published (yet?).
Page 8. Section 2.1.6 sbset/sbclr/sbinv/sbext
Please add:
These instructions efficiently support bitfields, and avoid costly constant constructions and nasty surprises especially if the bitfields are not in low bits.In the context of bit manipulation instructions, the term “bitfield” is commonly used in the context of instructions that manipulate contiguous regions of bits. (See for example the ARM instructions BFC, BFI, SBFX, etc.)
These are single bit instructions, not bitfield instructions.
Good point. I meant bitfield in the sense of the C language, i.e. an example likestruct nasty{unsigned long long counter:62;unsigned flag1: 1; // ouch, have to manipulate bit 62 and 63, good luck if your code depends on nasty being read in one go.unsigned flag2: 1:}Well, that's the special case of a bit field of size "1". The "counter" member is a bitfield of size 62.(Jfyi, you want to use the "packed" attribute when doing something like that.)
Since you already say that they often are not faster than software emulation, and one of the first hits you get for PEXT is that AMD implements them horribly slow, maybe they should not be in B but still be defined as Z extension.Did you even read the introduction section of the spec chapter?
Yes, and it has/had the BEXT /BDEP in the grey-shaded "suggested to be in B" block. But I guess one could implement Zbb instead.I actually meant literally _read_. The text.
If I counted correctly that would mean adding 82 new R-type instructions. That’s over half of a minor opcode. You can’t be serious.
I am serious. That there are so many ADD's comes from making it two families and internal logic of the encoding that says that if you have an instruction land the instruction doesn't preserve sign extension you must also have the corresponding W and D versions and it is cheap to do so because the proper sign extension hardware is there anyway. But note that every single one of them corresponds directly to (one of the admitted many) C language + operations and add's are bread and butter so it is OK if it takes up room. I think avoiding nasty surprises and being able to support C operations essentially 1 to 1 in the ISA is an important goal. One could reduce the size by half by ditching the sl and perhaps the b/bu versionsI think macro-op fusion is the answer here, not defining 82 almost identical instructions.The nice think about macro-op fusion is that it is completely optional. A small core might not bother, and a large one might do it, but both support the same ISA.By defining a flood of small instructions we'd be forcing unnecessary complexity on small cores.
Page 42 Section 2.12.4
I think the extraction sequence should be
slli rd rs1 (-(pos + len)) % XLEN
srli rd rd (-len) % XLENI think taking the remainder of a negative number is a bad idea in many contexts, but especially in something like a specification, considering that in languages like C the behavior is implementation-defined.
Should have been more clear that we are dealing with _unsigned_ integers (like uint_xlen_t) and the % operator is perfectly well defined for "negative" integers . Systematically writing % XLEN just stresses the fact that shifts in RV have shift amount that is naturally thought of as (the equivalence class of) a number mod LEN with modular arithmetic mod LEN (I always thought this was clever and clean in RV).More importantly, an expression like rs1 << (XLEN - rs2) is undefined in C for rs2 == 0, while in RV it does rs1 << 0, i.e. uses rs1 << ((-rs2) % XLEN)We are dealing with pseudo code here. There's no formal definition in that part of the spec what C type "pos" and "len" are. But from the context I think the assumption that this are (gnu) assembler expressions is justified. And afaict there's no formal definition on this level of detail how those work (https://sourceware.org/binutils/docs/as/Integer-Exprs.html) wrt things like signedness. So why not just avoid it the whole issue and not use modulo in this way.> this always makes sense, and does “the right thing” for 0 <= pos + len <= XLEN, pos >= 0, and len >= 0 except pos == len == 0 :-(.I’m not even sure what you imply “the right thing” is for len=0, and pos!=0
It should return 0 as you extract an empty string.Then your code does not do "the right thing" because your code will return the input unmodified (left shift by zero bits followed by right shift by zero bits).
One should really not have to use SHFL to pack bytes in a word. With PACK.BU and PACK.HU it becomes the straightforwardYou are saying that as if it would be obvious what’s bad about using shuffle for that. It’s not obvious to me. Why do you think we need to invent a new instructions to do something that can already be done with SHFL?
Because I feel uncomfortable with SHFL :-) But seriously, SHFL is not in Zbb and It seems specialised to me. On the other hand packing two bytes in a halfword or two halfwords in a word, seems like the kind of thing that is needed for something bulk and common like packing network packets or RGBA data and perfectly reasonable to expect from Zbb.Tbh I find it hard to identify a common theme behind you arguments. On the one hand you say ROL/ROR is not really needed because it can be emulated with three instructions for the case of immediate shift amount, and four instructions for the case of variable shift amount, and on the other hand you want a special instruction for RD = (RS1&255) | ((RS2&255)<<8) in Zbb. (And in many practical cases the "&255" is not actually needed.)Page52
The butterfly function is undefined and seems to depend on parameters other than just the stage number.The first paragraph in that section defines what a stage-N butterfly operation is.
Do you mean the two instruction unnamed (apparently #define type ) macro at the beginning of section 4.3.1 in the v0.91 version ?Then I think it would be clearer to name it there and make the a0, a1 a2 explicit in the parameters i.e.#define butterfly(N, a0, a1, a2, a3)grevi a2, a0, (1 << N) cmix a0, a1, a2, a0
with the corresponding change in the sections below.
The text above the two instructions literally reads "The following macro performs a stage-N butterfly operation on the word in a0 using the mask in a1."So I don't know where you get the idea that it is unnamed, or that it's unclear that a0 and a1 are arguments.
regards,- Clifford
Hi, ok it is fantastic to see this review taking place
, it is so long as an inline conversation that I am going to break the inline reply convention and summarise tge context.Regarding the suggestion of extending the adduw/slluw etc to a whopping 82+ operations covering all permutations of 8/16/32 on all src and dest operands:
1. It conpletely breaks the RISC paradigm
Op woensdag 4 september 2019 23:04:40 UTC+2 schreef lkcl:Hi, ok it is fantastic to see this review taking placeThe bad thing is nobody with more expertise did.
, it is so long as an inline conversation that I am going to break the inline reply convention and summarise tge context.Regarding the suggestion of extending the adduw/slluw etc to a whopping 82+ operations covering all permutations of 8/16/32 on all src and dest operands:
It is 22 on RV32 , 11 if you drop support for SUB (which is not expensive) and 4 if you also drop the shifted versions and prefer fusion.
pack a0, a0, a1
pack a1, a2, a3
shfl a0, a0, 8
shfl a1, a1, 8
pack a0, a0, a1
VBLK.regs[0] = {fmt:INT, elwidth:8, isvec:true, regidx:a0}
VBLK.regs[0] = {fmt:INT, elwidth:default, isvec:true, regidx:a4}
VBLK.setvl = 4
C.MV a0, a4
typedef union {
uint8_t actual_reg_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
uint8_t b[0]; // array of type uint8_t
uint16_t s[0]; // array of type uint16_t
uint32_t i[0]; // array of type uint32_t
uint64_t l[0]; // etc.
uint128_t d[0]; // etc.
} reg_t;
reg_t int_regfile[128];
1. It conpletely breaks the RISC paradigm
see my reply to Clifford. It is meant to be a pragmatic solution to the fact that while B and H are second class citizens, some people use (unsigned) shorts and signed bytes anyway, extending how ADDU.W and ADDWU are pragmatic solutions to the fact that people use unsigned ints.
The layout of the control word in rs2 is as follows. LEN=0 encodes for LEN=16.
| 3 2 1 |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|-------------------------------|-------------------------------|
| | LEN | OFF | DATA |
|---------------------------------------------------------------|
Hi,reflecting on Rogier's review I am now proposing two new instructions, REPACK[I] and PACKB:please feel free to comment on github.
On Friday, September 6, 2019 at 1:08:53 PM UTC+1, Rogier Brussee wrote:
Op woensdag 4 september 2019 23:04:40 UTC+2 schreef lkcl:Hi, ok it is fantastic to see this review taking placeThe bad thing is nobody with more expertise did.which has me quite concerned. bitmanip is a complex strategic area. clifford's original work (0.35, 0.36), including a proper static analysis audit, was superb. i think it's safe to say that he's getting a little overwhelmed and overloaded.i'd like to see other people help out, here., it is so long as an inline conversation that I am going to break the inline reply convention and summarise tge context.Regarding the suggestion of extending the adduw/slluw etc to a whopping 82+ operations covering all permutations of 8/16/32 on all src and dest operands:
It is 22 on RV32 , 11 if you drop support for SUB (which is not expensive) and 4 if you also drop the shifted versions and prefer fusion.still has me concerned. SV can achieve the packing with SIXTY PERCENT LESS instructions used compared to bitmanip (assuming 32-bit ops. with so little space in RVC, the argument for putting SHFL and PACK into RVC would need to be *really* watertight. i.e.: unlikely).
and remember that's with 4 8-bit values to pack. if there were more values to pack, it's the same SV instruction count.section 4.1.4 (vdraft, probably same as in 0.91):
1. It conpletely breaks the RISC paradigm
see my reply to Clifford. It is meant to be a pragmatic solution to the fact that while B and H are second class citizens, some people use (unsigned) shorts and signed bytes anyway, extending how ADDU.W and ADDWU are pragmatic solutions to the fact that people use unsigned ints.mmm.... i feel that this is a slippery-slope argument.
to explain that, let's take an analogy:"people want video processing, therefore SIMD is a pragmatic solution to the fact that people do a lot of 8/16-bit bit manipulation".
BitManip is supposed to be about *bit* manipulation, which is why i was so excited by the original 0.35 and 0.36 drafts. shuffle, grev, clz, pcnt and so on - these are really difficult to do efficiently with byte/word/dword level instructions and so are easily justifiable and worthwhile.
since then: single-bit-set, these are incredibly useful on embedded platforms, particularly when it comes to GPIO set/clear operations. without these single-bit instructions, you actually have to have special memory-mapped IO registers, where the same underlying I/O latch is mapped *three times* in different locations: one for the "real" value, one for "ORing", one for "CLEARing (&=~xxxx)likewise: although its encoding is slightly awkward, i can see a strong case for bfp. again, embedded systems, peripheral management in the linux kernel - being able to drop in a range of bits into a register? that's really really powerful.
*byte* level manipulation on the other hand? no. if there are *byte* level instructions being added, here, to me that's indicative that the *bit* level manipulation has either not been thought through or not designed correctly in the first place.
example: pack? mmm... no. i bet... if bfp was extended to 64 bit, you could actually use bfp instead. actually... hang on... the layout of bfp is as follows:
The layout of the control word in rs2 is as follows. LEN=0 encodes for LEN=16.
| 3 2 1 |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|-------------------------------|-------------------------------|
| | LEN | OFF | DATA |
|---------------------------------------------------------------|
my question is: OFFS+LEN is 12 bits. that will fit into an immediate. therefore, why is there no bfpi?
(a) there's a bug in bfp: it cannot access 16 bits, it can only access *15* bits.
[SNIP]
2) bitmanip needs to focus on *bit* manipulation, providing *byte* and *word* level manipulation as a side-effect
3) pack needs to go.
4) BFP can be extended and become much more powerful, covering its functionality and offering way more besides, replacing pack as well as the entire single-bit range of opcodes.
[snip]
>
> PACKB a0 a0 a1
> PACKB t0 a2 a3
> PACKH a0 a0 t0
3x32bit, 6x 16bit words.. not bad.
>
> where PACKH is PACK on RV32 and PACKW on RV64 or RV128
Which gives a hierarchical tree O ( log n ) number of instructions depending on pack width.
> 1. It conpletely breaks the RISC paradigm
>
>
> see my reply to Clifford. It is meant to be a pragmatic solution to the fact that while B and H are second class citizens, some people use (unsigned) shorts and signed bytes anyway, extending how ADDU.W and ADDWU are pragmatic solutions to the fact that people use unsigned ints.
Yep got it. And that they could needed a lot due to the RISCV function call API conventions.
Like it.
>
> mmm.... i feel that this is a slippery-slope argument.
>
>
> I don't agree but OK.
It's ok, I get the use case now.
> My proposal has absolutely nothing to do with SIMD but are two essentially separate proposals for
>
>
> 1. Supporting signed and unsigned chars and signed and unsigned shorts, signed and unsigned ints and signed and unsigned long long's and input_t , and their interactions with signed and unsigned ints and signed and unsigned long longs and intptr_t by supporting add's between them and in the process provide sign extension and zero extension for bytes, shorts, ints, and long long's (the zero extension can also be done with PACK). Sign and zero extension will take care of every other operation because this is where the slope starts. The only possible exception is SUB because it is so cheap to support. Any other sign and zero extension can be done just fine with the SLLI SR[AL]I trick.
Thie looks like something that is a description of important use-cases that would need to go into the bitmanip appendix.
Can you think of some examples that illustrate? (i am saying i am having difficulty processing the verbal desceiption)
char foo[]
void copy_foo(char* buf, unsigned short pos, unsigned short len)
{
//fools rush in where angels fear to tread
for (unsigned short i = pos; i < pos + len; ++i)
buf[i] = foo[i];
}
copy_foo:
la t0 foo
addw t1 a1 a2 #two unsigned shorts add to an unsigned _int_ assume they are properly zero extended already
bgeu a0 t1 endloop
loop:
add.hu t2 t0 a1 #do the adres calculation of foo[i] treating i/a1 as unsigned short!
lbu t2 0(t2)
add.hu t3 a0 a1 #do the adres calculation of buf[i] treating i/a1 as unsigned short!
sb t2 0(t3)
addwi a1 a1 1 #we don't care about overflow here
bltu a0 t1 loop
endloop:
ret
> 2. Support the most common fused shift/add as a regular instruction because it is easier to implement than fusion, less restrictive in the registers that can be used and/or shorter, and a single instruction is slightly easier to fuse with a load/store instruction.
Again apologies can you elaborate with some sequences?
struct foo **foo; #a0
size_t n; #a1
struct foo* foo_n; #a5
foo_n = foo[n];
addsl3 a5 a0 a1 //assume RV64 so a pointer is 8 byte
ld a5 0(a5)
[snip]
[> 3) pack needs to go.
>
>
> I totally disagree.
>
Yeah me too, now :)
I can see it being really useful if combined with SV, for packing large arrays of bytes/words.
One of the awesome things about SV is that the regfile is typecast. So if there is a regular array of mixed byte/word structs that needs processing or packing, we can throw parallel PACKB and PACKH etc operations at the data.
[skip]
Discussion for another time. Happy with PACKB, PACKH and BFI.
L.
gorc seems to simply be the following:
grev t0, t1, rs2
or t0, t0, t1
Is that correct? If so, what significant general purpose savings are there which justify the additional opcode?
If an extremely commonly used algorithm exists for which gorc would reduce code size by significant margins compared to those two opcodes (grev, or) it would be good to list it.
Otherwise gorc is better off as a macro op fusion.
Or, is there something mathematically diffferent about the ORing at each layer where the partial results cascade and you get different answers?
L.
For a limited set of CRCs, I would have picked the CCITT CRCs myself: CRC32 & CRC16.
Can’t larger CPUs do CRC32C, etc. with the Galois multiplies in the security subset?
So, only very small embedded CPUs need CRC instructions. They often need to communicate, thus the CCITT CRCs.
The applications in the Wikipedia CRC article seems to support this.
Essentially the same problems that you have with unsigned int's on RV64, you have with unsigned shorts and unsigned bytes on RV32. Of course it is not good practice to use unsigned shorts, it just happens anyway. For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.
>
> 1. Supporting signed and unsigned chars and signed and unsigned shorts, signed and unsigned ints and signed and unsigned long long's and input_t , and their interactions with signed and unsigned ints and signed and unsigned long longs and intptr_t by supporting add's between them and in the process provide sign extension and zero extension for bytes, shorts, ints, and long long's (the zero extension can also be done with PACK). Sign and zero extension will take care of every other operation because this is where the slope starts. The only possible exception is SUB because it is so cheap to support. Any other sign and zero extension can be done just fine with the SLLI SR[AL]I trick.
Thie looks like something that is a description of important use-cases that would need to go into the bitmanip appendix.
Can you think of some examples that illustrate? (i am saying i am having difficulty processing the verbal desceiption)
char foo[]
void copy_foo(char* buf, unsigned short pos, unsigned short len)
{
//fools rush in where angels fear to tread
for (unsigned short i = pos; i < pos + len; ++i)
buf[i] = foo[i];
}
copy_foo:
la t0 foo
addw t1 a1 a2 #two unsigned shorts add to an unsigned _int_ assume they are properly zero extended already
bgeu a0 t1 endloop
loop:
add.hu t2 t0 a1 #do the adres calculation of foo[i] treating i/a1 as unsigned short!
lbu t2 0(t2)
add.hu t3 a0 a1 #do the adres calculation of buf[i] treating i/a1 as unsigned short!
sb t2 0(t3)
addwi a1 a1 1 #we don't care about overflow here
bltu a0 t1 loop
endloop:
ret
Essentially the same problems that you have with unsigned int's on RV64, you have with unsigned shorts and unsigned bytes on RV32. Of course it is not good practice to use unsigned shorts, it just happens anyway. For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.
[snip]
copy_foo:
la t0 foo
addw t1 a1 a2 #two unsigned shorts add to an unsigned _int_ assume they are properly zero extended already
bgeu a1 t1 endloop
loop:
add.hu t2 t0 a1 #do the adres calculation of foo[i] treating i/a1 as unsigned short!
lbu t2 0(t2)
add.hu t3 a0 a1 #do the adres calculation of buf[i] treating i/a1 as unsigned short!
sb t2 0(t3)
addwi a1 a1 1 #we don't care about overflow here
bltu a1 t1 loop
endloop:
ret
addwi a1 a1 1 #AARGH we very much _DO_ care about overflow here
bltu a0 t1 loop #Because t1 may be MAXSHORT + 1
loop:
add t2 t0 a1 #do the adres calculation of foo[i] treating i/a1 as size_t because we already zero extended!
lbu t2 0(t2)
add t3 a0 a1 #do the adres calculation of buf[i] treating i/a1 as size_t because we already zero extended !
sb t2 0(t3)
addwi a1 a1 1
zexth a1 a1 #we zero extend
bltu a1 t1 loop #because we DO care about overflow here
endloop:
unsigned short pos;
char foo[];
usigned short idx = pos + 1000;
char c = foo[idx];
la t0 foo
addwi t1 a1 1000
add.hu t0 t0 t1
lbu t0 0(t0)
On Mon, Sep 9, 2019, 11:13 Rogier Brussee <rogier...@gmail.com> wrote:Essentially the same problems that you have with unsigned int's on RV64, you have with unsigned shorts and unsigned bytes on RV32. Of course it is not good practice to use unsigned shorts, it just happens anyway. For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.Do note that there are languages such as Rust (which is being increasingly used for embedded and systems programming as a replacement for C and C++) where signed and unsigned overflow is never undefined behavior -- it is defined to always wrap (release mode)
Sorry this went under the wrong thread.gorc seems to simply be the following:
grev t0, t1, rs2
or t0, t0, t1Is that correct?
If so, what significant general purpose savings are there which justify the additional opcode?
If an extremely commonly used algorithm exists for which gorc would reduce code size by significant margins compared to those two opcodes (grev, or) it would be good to list it.
[..] For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.
Hi,
On Mon, Sep 9, 2019 at 8:13 PM Rogier Brussee <rogier...@gmail.com> wrote:[..] For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.That's not entirely true. Because, as you have stated correctly, overflow for signed data types is undefined in C, it is safe for a compiler to not sign-extend in most cases involving signed data types. There's no need to adhere to the calling convention once undefined behavior is triggered.
So only fast zero extension is required, which is provided via "pack[w] x, x, zero" for "short" and "andi x, x, 255" for "char". (Other languages that have defined behavior for signed overflow might require two shifts to sign-extend, which may be one instruction on architectures that macro-op-fuse such sequences.)
regards,- Clifford
Op maandag 16 september 2019 09:46:51 UTC+2 schreef clifford:Hi,On Mon, Sep 9, 2019 at 8:13 PM Rogier Brussee <rogier...@gmail.com> wrote:[..] For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.That's not entirely true. Because, as you have stated correctly, overflow for signed data types is undefined in C, it is safe for a compiler to not sign-extend in most cases involving signed data types. There's no need to adhere to the calling convention once undefined behavior is triggered.I was just going to uncomfortably agree with you when I realised that I was wrong about the overflow.Disclaimer: I am not a language lawyer (and have no ambitions in that direction). However what I wrote below seems to be backed up by
> n = ctz(~gorc(x,7)) / 8
>
> finds the index of the first zero byte in x or is XLEN/8 if there is no zero byte.
Yes, exactly. And gorc(x,3) finds zero nybbles, gorc(x,1) finds zero
(aligned) pairs of bits, gorc(x,15) finds zero shorts etc. Of course
preconditioning x with an xor or subtract allows you to find any
arbitrary value, not just zero.
Small machines that don't want the full cost of GREV+GORC (similar to
a barrel shifter) can implement only the most useful combinations
using simpler circuitry .. for example bit reversal, byte reversal
(endian conversion), zero byte detection .. but use the same opcodes
as full GREVI/GORCI.
A general note: not everything in V0.9 (or V0.91, which was already
current when Roger wrote this) will make it into the final spec.
Things in V0.91 now are things that appear to be worth *evaluating*
for their potential contribution to code size, execution speed, or
both. That evaluation will probably be negative in at least some
cases. The evaluation is taking place now and will be for probably
several months.
cycles if your achievable clock speed drops by 1%. I think some things
the current draft shows as being in B will or should fall out of it,
perhaps including such things as pcnt, slo/sro, rol/ror, and
grev/gorc/[un]shfl. "B" might well be seen as too expensive by makers
of microcontrollers.
The problem is I believe pcnt is significantly more expensive than ctz
or clz and might not make the cut for generic B , while I'd think at
least one of clz/ctz would (and brev too).
> page 5 section 2.1.3. Logic with negate (andn, orn, xnor)
>
>
> Comment: I like that all basic logical operations have a “n” version.
>
> Is nand and nor not also worth it (the RVV mask instructions seem to have them for example) ?
I don't think so. Nand and nor can be done by negating the input of
the operation using their results. If both inputs of the dependent and
or or need negating then use DeMorgan.
> This would make AND a 3 bit family (i.e. a basic instruction with some “option flags” or small parameters, here nothing/~ on rs2, nothing/-!! on rs1, nothing/~ on rd) and OR and XOR a 1 bit family of instructions (nothing/~ on rd) that can be placed with the AND OR and XOR instructions themselves (i.e an instructions ANDN, NAND and AND.NE are supposed to have the same func3 values as AND being “AND”-like, but differ in a few bits in the funct7 space ).
My suspicion is this is all going well beyond the point of general
usefulness and benefit.
In fact I have my doubts that anything more than andn is justified by
actual frequency of use in general code.
> ah. right. the problem with that is: it risks further fragmentation. what i would suggest instead, is: similar to the way that variable bitshift may be done with an FSM that simply blocks for multiple cycles but otherwise produces the correct answer, an algorithm be researched *and included* in the specification, which provides the *full* identical capability of GREVI/GORCI, is very efficient in terms of minimising gate count, but simply takes far more cycles to complete.
Such minimal gate count implementations are going to take longer to
run than simply compiling the C reference implementation in the manual
to the base ISA. You'd be better off to trap and emulate.
I'm not averse to an extension to speed up various common operations
(e.g. (add, sub)wu, (add, sub, slli)u.w, pack), which can of course
overlap with B, but we ought to design it as such. I propose the name
U for this extension, as it can be considered to mean the "utility"
extension, and is phonetically the back-vowel counterpart of I.I think U is already taken for user-mode.
On Mon, Sep 2, 2019 at 1:55 PM Rogier Brussee <rogier...@gmail.com> wrote:
> Review of Xbitmanip V0.9
>
> Rogier Brussee
>
> 2019-09-03
A general note: not everything in V0.9 (or V0.91, which was already
current when Roger wrote this)
will make it into the final spec.
Things in V0.91 now are things that appear to be worth *evaluating*
for their potential contribution to code size, execution speed, or
both. That evaluation will probably be negative in at least some
cases. The evaluation is taking place now and will be for probably
several months.
As a personal opinion I expect that we will end up with:
- a "B" extension aimed at being virtually automatically included in
all future application processors, that is those aimed at running full
OSes such as Linux, and currently implementing RV64GC. As such, it
should include only things that provide a tangible benefit for general
code, don't have a large cost in gates, and in particular don't impact
the CPU cycles time -- there's no point in getting a 0.1% decrease in
cycles if your achievable clock speed drops by 1%. I think some things
the current draft shows as being in B will or should fall out of it,
perhaps including such things as pcnt, slo/sro, rol/ror, and
grev/gorc/[un]shfl. "B" might well be seen as too expensive by makers
of microcontrollers.
- a set of Zbxxx extensions that provide a large benefit to at least
some applications, which implementors can pick and choose. Each Zbxxx
extension contains related instructions -- that are synergistic in use
and perhaps in sharing circuitry also. Each Zbxxx extension is
assigned official opcodes so that everyone uses the same ones and
standard toolchains can be shared.
> General remarks:
>
>> Great that the B extension is moving again and making progress. Special thanks to Clifford for pushing this for a long time, must have been an awful lot of work.
Seconded!
I'll try not to comment on things that Clifford already addressed.
> It would be useful if a rationale for the use of an instruction would be given and quantitative data on use were available.
Yes. We also need provenance of each instruction, as Dave Patterson
has done for the base instruction set.
> Comment: I would propose a further criterium:
>
> Support compilers implementing the standard C like language basic data types (pointers, _Bool/bool, char, [un]signed char, [unsigned] short, [unsigned] int, [unsigned] long long) and their basic operations without surprises regarding performance and/or code size, in particular for software written sub optimally or not written for RiscV e.g. using unsigned or short integers.
Agreed. The *wu and *u.w instructions fall into this category and
should eliminate a current RV64 weakness in dealing with unsigned int
loop counters and array indexes in LP64 code.
> Page 4. Section 2.1. Basic Manipulation instructions
>
> Comment: It seems to me that the B extension has moved to “integer instructions that extend or optimise the basic I instructions. I think this is a good thing, and I would argue that the B extension should position as slightly cheaper than the M extension, useful for basic compiler support, standard applications like bit banging in device drivers, networking, and very basic algorithms like hashing.(see also the remark on supporting the C/C++ language). Z extensions can then cater for more specialised purposes.
Agreed. I've been thinking for some time that the scope of B seems to
have expanded beyond "bit manipulation" -- and that this is a good
thing and it's good to include a few other things such as the *wu and
*u.w instructions in it. We should probably rename it, but I'm not
sure to what.
> Maybe it is useful to have a parity instruction even if this can be implemented with pcnt and &1, because it should use less power.
I think that may be worth defining an opcode, yes. Or at least
evaluating. Unary opcodes are cheap.
> Proposal:
>
> We have ctz(n) == pcnt(~n & (n -1)) == pcnt(~ (n |(-n))),
>
> A two register instruction
>
> cntz rd rs1 rs2 : rd = popcount(~ (rs1 | (-rs2)))
>
> Would allow pcnt, ctz and clz to be implemented as 3 macros with at most two instructions instead of 3 instructions
>
>
> pcnt rd rs1 : addi rd rs1 1; cntz rd zero rd
>
> ctz rd rs1 : cntz rd rs1 rs1
>
> clz rd rs1 : brev rd rs1; cntz rd rd rd
Interesting.
The problem is I believe pcnt is significantly more expensive than ctz
or clz and might not make the cut for generic B , while I'd think at
least one of clz/ctz would (and brev too).
> page 5 section 2.1.3. Logic with negate (andn, orn, xnor)
>
>
> Comment: I like that all basic logical operations have a “n” version.
>
> Is nand and nor not also worth it (the RVV mask instructions seem to have them for example) ?
I don't think so. Nand and nor can be done by negating the input of
the operation using their results. If both inputs of the dependent and
or or need negating then use DeMorgan.
> Proposal:
>
> I would propose the following additional logic instructions
>
>
> #and rd rs1 rs2 # rd = rs1 & rs2
>
> andn rd rs1 rs2 # rd = rs1 &~ rs2
>
> nand rd rs1 rs2 # rd = ~(rs1 & rs2)
>
> orn rd rs1 rs2 # rd = ~(rs1 &~ rs2) = ~rs1 | rs2
>
>
>
> and.ne rd rs1 rs2 # rd = (-!!rs1) & rs2 == (!!rs1) ? rs2 : 0
>
> andn.ne rd rs1 rs2 # rd = (-!!rs1) &~ rs2 == (!!rs1) ? ~rs2 : 0
>
> orn.eq rd rs1 rs2 # rd = ~((-!!rs1) & rs2) == (-!rs1) |~ rs2 == (!!rs1) ? ~rs2 : (-1)
>
> or.eq rd rs1 rs2 # rd = ~((-!!rs1) &~ rs2) == (-!rs1) | rs2 == (!!rs1) ? rs2 : (-1)
>
>
> #or rd rs1 rs2 # rd = rs1 | rs2
>
> nor rd rs1 rs2 # rd = ~(rs1 | rs2)
>
> #xor rd rs1 rs2 # rd = rs1 ^ rs2
>
> xnor rd rs1 rs2 # rd = ~(rs1 ^ rs2) == rs1 ^ rs2 ^ (-1) == rs1 ^ ~ rs2
>
>
> This would make AND a 3 bit family (i.e. a basic instruction with some “option flags” or small parameters, here nothing/~ on rs2, nothing/-!! on rs1, nothing/~ on rd) and OR and XOR a 1 bit family of instructions (nothing/~ on rd) that can be placed with the AND OR and XOR instructions themselves (i.e an instructions ANDN, NAND and AND.NE are supposed to have the same func3 values as AND being “AND”-like, but differ in a few bits in the funct7 space ).
My suspicion is this is all going well beyond the point of general
usefulness and benefit.
In fact I have my doubts that anything more than andn is justified by
actual frequency of use in general code. That's not to say we can't
evaluate the others.
> Page 29 2.7.2
>
>
> Comment:
>
> We have the same issue with the ternary operator here of course.
>
>
> If we have the AND.NE instruction we can proceed as for the mix operator and define the fusable sequences
>
>
> CMOV rd rs1 rs2 rs3. # rs1 ? rs2 : rs3 assume rd != rs1, rs3
>
> {
>
> XOR rd rs2 rs3, assume rd != rs1, rs3
>
> AND.NE rd rs1 rd
>
> XOR rd rd rs3
>
> }
Ok, I'm warming to your AND.NE. But maybe not all its friends.
> Page 31 Section 2.8 Unsigned address calculation
>
>
> Comment: I like these instructions. Since adds are so dominant in usage and in particular in address calculations in most code they may well turn out the most used instructions of the B extension.
Yes. Which I think justifies the expansion of the definition of B.
Though you could sneak it in under the justification that bit
manipulations are traditionally regarded as being done on bags of bits
aka unsigned numbers, especially shifts.
> Likewise I think there is value in having pointer + integer instruction i.e. an instruction that absorbs the shift like LEA in x86 (one of the most commonly used arithmetic instructions on x86, although that is a bit misleading because it is also the 2 operand add on x86). People have been touting fusion of
>
> SLLI rd rs2 imm
> ADD rd rs1 rd
>
> but note that one cannot use the RVC version of SLLI unless rd == rs2 (the register containing the index) and for basic implementations an “already fused instruction”
>
> ADDSL<k> rd rs1 rs2 # SLLI rd rs2 <k> ; ADD rd rs1 rd
>
> actually seems easier to implement than fusion.
I like it, except:
1) isn't there a real danger that cascading a shift into an add
impacts the cycle time? Yes, some implementations could split it, but
the general RISC-V philosophy is fusion not splitting because this
puts the complexity burden on the already-complex implementations not
on the very simple implementations.
2) in most uses, which is in loops, the whole thing is usually
strength-reduced to ADDI rd, rd, 1<<k anyway.
> Page 32, Section 2.10 Future compressed instructions
>
>
> I would take this section out, or just suggest that C.NOT, C.NEG and C.ZEXT.W and C.ZEXT.D may be worthwhile additions. More quantitative data is needed however, and this just cannot be done just for the B extension. For example I think another instruction that may be worthwhile is
>
>
>
> C.LPC rd —> AUIPC rd 0. # load program counter
I like this, although note that C.JAL with an offset of 2 serves the
same purpose and could if necessary be recognized as a special case.
It would however force savinf of RA even in leaf functions that don't
otherwise require it.
> While rol and ror are the easiest to explain permutation, they are also the most easily done with just shifts and OR’s. The butterfly network is also not really simpler than that for GREV, i.e slightly expensive. With the below SLLNEG/SRLNEG and more importantly, immediate versions, it is just 3 instructions. MIPS doesn’t have a ROL and ROR instruction. Are ROL and ROR actually worth it?
I agree, the benefit is marginal and only in pretty specialized code,
much of which would be much better accelerated with a proper SHA or
whatever instruction.
I have a few overlapping comments about ROR/ROL.
- I guess one can argue that cryptographic workloads are "specialized".
That said, I believe that they are important enough and frequently
executed enough to mean that they deserve special attention.
I see Dhrystone was discussed today. While Dhrystone result is hard to judge (e.g., equivalent compiler options between armcc and arm-gcc).
How is the expected score for Coremark? So far I didn’t see it in this mail thread.
Thanks.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit
https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAC2bXD7YfcXgUd2By3s5Z%3DD0b0Oun%3DjyuUUW%2B31GgHo9wm9vnQ%40mail.gmail.com.
> While rol and ror are the easiest to explain permutation, they are also the most easily done with just shifts and OR’s. The butterfly network is also not really simpler than that for GREV, i.e slightly expensive. With the below SLLNEG/SRLNEG and more importantly, immediate versions, it is just 3 instructions. MIPS doesn’t have a ROL and ROR instruction. Are ROL and ROR actually worth it?
I agree, the benefit is marginal and only in pretty specialized code,
much of which would be much better accelerated with a proper SHA or
whatever instruction.
I have a few overlapping comments about ROR/ROL.
- These are essential to cryptographic code. Without them, you cannot be
performant for the common standardised block ciphers or hash functions.
- RISC-V will/must have a rotate instruction. Clearly there is discussion
to be had about whether it lives in bitmanip or in the Crypto extension.
- I guess one can argue that cryptographic workloads are "specialized".
That said, I believe that they are important enough and frequently
executed enough to mean that they deserve special attention.
- Linux uses ChaCha20 as its CRNG implementation, which suffers very
heavily without a rotate instruction. [1,3,4]
- A shift/or based rotate is between 48 and 96 bits of instruction data to
fetch.
As opposed to (assuming no compressed version) 32-bits of
rotate. The energy efficiency impact on crypto code for having to
execute three instructions rather than one will not be good. I don't
have numbers for this, but I'd be keen to see the energy efficiency
impact of favouring shift/or over rotate.
- A fused shift/or implementation requires an extra working register.
- The Crypto Task Group have discussed sharing instructions between crypto
and bitmanip. E.g. if one wanted to implement the crypto extension and not
bitmanip, you could still used the ROR/ROL instructions as defined in
bitmanip, to save crypto defining it's own ROR/ROL
.
- I don't think one can argue against ROR/ROL by saying any of
their use cases are better catered too by a dedicated accelerator.
Such accelerators lack flexibility, and are too expensive for many cores
or SoCs where ROR/ROL would be most useful. I.e. at the smaller end of
things.
- Even the most basic ARM micro-controller has rotate functionality.
Many of the new NIST lightweight cryptography[2] candidates
provide reference code for ARM. Any attempt to port these to RISC-V
will immediately show that it's lack of a rotate instruction severely
impacts their performance. This has already been noted by our research
group and others[3].
- On a very specialised but also important note: when writing power/EM
side channel resistant code, ROR/ROL has a useful property in that
it does not change the hamming weight of its inputs. This is used in
the security arguments for some classes of ARX (Add, rotate xor)
block cipher. Having to emulate with shift/or will absolutely change
the hamming weight of the inputs. The NIST competition explicitly
asks for this sort of consideration of side-channel resilience
from the candidates.
Whilst a case can be made for doing video as a hardware accelerated block, the fact that recent ARM64 processors can do virtually any 720p modern and older algorithm with NEON only is extremely compelling.
With video decode being such a huge market the case for its investigation and inclusion *prior* to finalisation of BitManip should be really clear.
What is the process by which this may be discussed, in a way that is inclusive to all stakeholders?
L.
Also I haven't seen any mention of video encode and decode examples. I would amticipate that at the very least, BitManip would be used for efficient YUV RGB conversion, across a range of bit widths on either side (RGB8888, RGB565 etc).
Op donderdag 19 september 2019 10:25:58 UTC+2 schreef Ben Marshall:
> While rol and ror are the easiest to explain permutation, they are also the most easily done with just shifts and OR’s. The butterfly network is also not really simpler than that for GREV, i.e slightly expensive. With the below SLLNEG/SRLNEG and more importantly, immediate versions, it is just 3 instructions. MIPS doesn’t have a ROL and ROR instruction. Are ROL and ROR actually worth it?
I agree, the benefit is marginal and only in pretty specialized code,
much of which would be much better accelerated with a proper SHA or
whatever instruction.
It was very much a question. Your opinion is much better informed than mine. Thanks for joining the discussion.I have a few overlapping comments about ROR/ROL.
- These are essential to cryptographic code. Without them, you cannot be
performant for the common standardised block ciphers or hash functions.
- RISC-V will/must have a rotate instruction. Clearly there is discussion
to be had about whether it lives in bitmanip or in the Crypto extension.
- I guess one can argue that cryptographic workloads are "specialized".
That said, I believe that they are important enough and frequently
executed enough to mean that they deserve special attention.
- Linux uses ChaCha20 as its CRNG implementation, which suffers very
heavily without a rotate instruction. [1,3,4]
I cannot really judge how relevant this very focussed micro benchmark within the kernels use of CRNG is, but it clearly takes a (surprisingly big) hit.
- A shift/or based rotate is between 48 and 96 bits of instruction data to
fetch.
it seems hard to get 48 bit, but 64 bit is doable (with an immediate and RVC)
srli a5, a0, (-imm)
C.slli a0, a0, immC.or a0, a0, a5
As opposed to (assuming no compressed version) 32-bits of
rotate. The energy efficiency impact on crypto code for having to
execute three instructions rather than one will not be good. I don't
have numbers for this, but I'd be keen to see the energy efficiency
impact of favouring shift/or over rotate.
Sounds like valuable data to obtain.
- Do you see value in a ROL instruction (which is ROR with -rs2) if ROR is provided?
- Do you see value in the CLMUL and BMATXOR instructions? Does security impose special implementation considerations for these instructions?
- Do you see bitmanip-like primitives you feel would be very useful that have been missed ?
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/be20eda0-d1ac-4649-a7e4-b5dfe8088464%40groups.riscv.org.
On 19/09/2019 15:43, Rogier Brussee wrote:
Op donderdag 19 september 2019 10:25:58 UTC+2 schreef Ben Marshall:
[snip]
As opposed to (assuming no compressed version) 32-bits of
rotate. The energy efficiency impact on crypto code for having to
execute three instructions rather than one will not be good. I don't
have numbers for this, but I'd be keen to see the energy efficiency
impact of favouring shift/or over rotate.
Sounds like valuable data to obtain.
Agreed. I think it's been missing from a lot of the macro-op fusion based
justifications around RISC-V.
- A fused shift/or implementation requires an extra working register.Do you think a security oriented ROR/ROL imposes special implementation restrictions?
I can't see why ROR/ROL would be disproportionately expensive compared to something like a SLL/SRL.
Crypto code always does it's shift/rotations by constant amounts.
- Do you see value in a ROL instruction (which is ROR with -rs2) if ROR is provided?
For the most part, that means you'd use the immediate rotate/shift instruction anyway.
The only time I've seen the need for a variable ROR/ROL is when the
rotation distance is dependent on something like a table of round
constants (like the Keccak Round function for SHA3), which get loaded
in from memory at the start of a round. Even then, you can just change
the round constant in memory to achieve the same effect. So, no, I've
not yet seen enough examples for crypto where ROL is essential in
addition to ROR.
- CMOV is useful for constant time implementations.
- I believe that CMIX will prove useful for certain types of power/EM
countermeasures, but I haven't got far enough into that yet to be
sure.
Cheers,
Ben
--
Ciao
Rogier
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa...@groups.riscv.org.
> Especially because Crypto code tends to be wrapped up in a library, which developers are (understandably) taught not to try and re-roll themselves.
Sometimes, the sensible ones listen! YesI have been on sci.crypt for a while back around 2003 :)
The appearance of primitives in BitManip that are suitable for cryptography: I have a couple of questions.
Firstly, it seems to be quite important to have them, despite their specialist use: it's not like the linux kernel is not popular or anything, and without a good PRNG a heck of a lot of things break. Would you be happy to write up a suitable section to be added to the spec which explains this?
Secondly, along the lines of RISC principles, it is sounding like a separate crypto Extension *is* the BitManip extension, given the applicability of so many of the opcodes to hashing and symmetric ciphers. Would you concur?
L.
- I believe that CMIX will prove useful for certain types of power/EM
countermeasures, but I haven't got far enough into that yet to be
sure.
Is there a reason why using a sequence of simple bit instructions like
CMIX rd rs1 rs2 rs3 # rd = rs1&rs2 | (~rs1)&rs3 , assume rd != rs1, rs3
{XOR rd rs2 rs3,
AND rd rd rs1XOR rd rd rs3
}
is problematic from a power/EM point of view?
--
[snip]Thanks!
Rogier
Cheers,
Ben
--
Ciao
Rogier
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/be20eda0-d1ac-4649-a7e4-b5dfe8088464%40groups.riscv.org.
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9764e488-2f59-4d91-ad53-4ce06b642666%40groups.riscv.org.
In general, it is very hard to reason about power/EM properties ofis problematic from a power/EM point of view?
a piece of cryptographic software above the ISA level.
One really needs to consider the micro-architecture of the
core (and indeed SoC) being evaluated. I temper any assertions I go on to make with that caution.
The Bitmanip spec already mentions the PRNG use case for CLMUL.
I don't know if that sort of explanation of use-cases belongs in the spec
or somewhere else,
but I agree it's a useful thing to have.
> Secondly, along the lines of RISC principles, it is sounding like a separate crypto Extension *is* the BitManip extension, given the applicability of so many of the opcodes to hashing and symmetric ciphers. Would you concur?
Almost. I disagree that the bitmanip extension serves entirely as the
crypto extension.
There is clearly a lot of overlap, and my hope is that much of the
overlapping functionality can be shared between the two.
e.g. One could implement just the crypto extension or just bitmanip
and use the same opcode for ROR/CMOV etc.
There are however other things which a crypto extension will need which
do not belong in Bitmanip. Public key cryptography support is the big one,
which boils down to long arithmetic.
Current RISC-V support for long
arithmetic suffers because of how it handles carry, and the lack of
indexed load and store.
Having a dedicated crypto extension also means you can add very specialist
instructions for algorithms like AES/SHA*, which clearly belong on
their own special set.
Crypto code always does it's shift/rotations by constant amounts.
- Do you see value in a ROL instruction (which is ROR with -rs2) if ROR is provided?
For the most part, that means you'd use the immediate rotate/shift instruction anyway.
The only time I've seen the need for a variable ROR/ROL is when the
rotation distance is dependent on something like a table of round
constants (like the Keccak Round function for SHA3), which get loaded
in from memory at the start of a round. Even then, you can just change
the round constant in memory to achieve the same effect. So, no, I've
not yet seen enough examples for crypto where ROL is essential in
addition to ROR.
Op maandag 16 september 2019 09:46:51 UTC+2 schreef clifford:On Mon, Sep 9, 2019 at 8:13 PM Rogier Brussee <rogier...@gmail.com> wrote:[..] For signed shorts and chars you can invoke undefined behaviour to not care about overflow (and do you want to?). However you need sign/zero extension every time you pass a signed/unsigned short /char by the calling convention.That's not entirely true. Because, as you have stated correctly, overflow for signed data types is undefined in C, it is safe for a compiler to not sign-extend in most cases involving signed data types. There's no need to adhere to the calling convention once undefined behavior is triggered.I was just going to uncomfortably agree with you when I realised that I was wrong about the overflow.[..]There is no undefined behaviour for the adding shorts because arithmetic on shorts first does a conversion to an int and then adds the ints, and only at assignments do you do i.e.extern short a,b;short c = a + b;is equivalent toshort c = (short)((int)a + (int)b);
Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
total 10489586sextb 2198 (0.0210%)sexth 595 (0.0057%)zextb 150 (0.0014%)zexth 4068 (0.0388%)
Hi,
[..]There is no undefined behaviour for the adding shorts because arithmetic on shorts first does a conversion to an int and then adds the ints, and only at assignments do you do i.e.extern short a,b;short c = a + b;is equivalent toshort c = (short)((int)a + (int)b);Oops. Now that you mention it I remember this.. You are of course correct.There is still a signed overflow when converting the int result back to a short. In C99 and C11 we find the following language (6.3.1.3):Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.But because this is implementation defined and not undefined it would indeed not be okay to violate the calling convention and the sign extension must be fixed with a slli+srai sequence.
That being said, I still don't think it is an urgent matter. A while back I disassembled the entire risc-v debian base image (debian-riscv64-tarball-20180418) and checked the (static) frequency of byte and half-word sign extension and zero extension sequences:total 10489586sextb 2198 (0.0210%)sexth 595 (0.0057%)zextb 150 (0.0014%)zexth 4068 (0.0388%)("total" is the total number of instructions in all binaries and shared objects included in that base image)
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/4dfaf2a0-54b2-4b9f-8adc-89a172fba149%40groups.riscv.org.
Moving the sext.h / sext.b discussion to https://github.com/riscv/riscv-bitmanip/issues/46.
Quickly:
There is a separate crypto extension that has a lot of work out into it. It may share some opcodes with bitmanip ( which means those ops are present if either extension is implemented). Or not.
Secondly: if you want to avoid power side channels, don’t bother trying to craft specially implemented opcodes; just add a crypto coprocessor that is specially designed to do that (Crrypography Research, a division of Rambus, has demonstrated this at several events and licenses the cores.
Jacob
On Monday, September 23, 2019 at 3:37:01 AM UTC+8, Rogier Brussee wrote:
> The question here is whether assuming you have ROR, do you also need ROL which would be just
>
>
> ROL rd rs1 rs2 ;t # assume t != rs1
> {
> NEG t, rs2
> ROR rd, rs1, t
> }
Hi Rogier, Jacob's travelling at the moment so does not have easy computer access.
I looked up http://pcg-random.org and found the algorithm on wikipedia:
https://en.m.wikipedia.org/wiki/Permuted_congruential_generator
I am guessing here that what Jacob is pointing out is that the inner loop on PCG is so small that removing one or other of those ops and replacing it with a 2-op sequence would have a disproportionately large impact on performance.
In the link you send, only ROR is used, which is sort of the point.
To unsubscribe from this group and stop receiving emails from it, send an email to isa...@groups.riscv.org.
On Mon, Sep 23, 2019 at 10:44 AM Bruce Hoult <bruce...@sifive.com> wrote:
>
> On Mon, Sep 23, 2019 at 12:59 AM Rogier Brussee
> <rogier...@gmail.com> wrote:
> >
> > Op maandag 23 september 2019 03:18:45 UTC+2 schreef lkcl:
> >>
> >> On Monday, September 23, 2019 at 3:37:01 AM UTC+8, Rogier Brussee wrote:
> >>
> >>
> >> > The question here is whether assuming you have ROR, do you also need ROL which would be just
> >> >
> >> >
> >> > ROL rd rs1 rs2 ;t # assume t != rs1
> >> > {
> >> > NEG t, rs2
> >> > ROR rd, rs1, t
> >> > }
> >>
> >> Hi Rogier, Jacob's travelling at the moment so does not have easy computer access.
Yeah, sorry about the delay.
> >>
> >> I looked up http://pcg-random.org and found the algorithm on wikipedia:
> >> https://en.m.wikipedia.org/wiki/Permuted_congruential_generator
> >>
> >> I am guessing here that what Jacob is pointing out is that the inner loop on PCG is so small that removing one or other of those ops and replacing it with a 2-op sequence would have a disproportionately large impact on performance.
Yes, that's the case.
> >>
> >
> > In the link you send, only ROR is used, which is sort of the point.
Yes, I concede that only one or the other is needed, though if there's
readily available space, I think both left and right rotate should be
supported for both symmetry as well as there are likely to be
additional algorithms both now and in the future that have similar
requirements for variable rotates, but with different directions. The
additional hardware to invert the rotation direction should be quite
small (5-bit negate for RV32), so not much of an issue.
If we do decide to have only one rotate direction, I think we should
pick the direction that is a single instruction to match PCG since it
is intended to generate random numbers very quickly, rotate
instructions would have a disproportionately large impact due to the
short inner loop, and PCG is becoming commonly used.
>
> Yes, I saw that.
>
> But also, by definition the rotate amount is random, so it doesn't
> actually matter whether it's left or right.
It does matter, because people often pick a particular algorithm so
they can obtain the same random sequence cross-platform (eg.
synchronizing random events in a networked game).
Jacob Lifshay
On Thu, Sep 26, 2019 at 2:34 PM Rogier Brussee <rogier...@gmail.com> wrote:
>
> Op donderdag 26 september 2019 04:06:40 UTC+2 schreef Jacob Lifshay:
>>
>> Yes, I concede that only one or the other is needed, though if there's
>> readily available space, I think both left and right rotate should be
>> supported for both symmetry as well as there are likely to be
>> additional algorithms both now and in the future that have similar
>> requirements for variable rotates, but with different directions. The
>> additional hardware to invert the rotation direction should be quite
>> small (5-bit negate for RV32), so not much of an issue.
>
>
> From a consistency point of view, I would say: either have a NEG version
> for all shift like operations (SLL<-> SLLNEG ,SLA <-> SLANEG,
> SRA <->SRANEG, ROR<->ROL), or none at all. It is not a terribly big deal
> though.
(assuming you meant SRL[NEG] not SLA[NEG])
I was looking at it from the viewpoint of wanting left-right symmetry
rather than neg symmetry, since otherwise rotate would be the only
instruction that doesn't have left-right symmetry, neg just happens to
be a way to implement left/right rotate in terms of right/left rotate.
Left-right symmetry:
SLL <-> SRL
SLL <-> SRA (would actually be SLA <-> SRA, but SLA is redundant)
ROL <-> ROR
Let me explain by comparing the immediate versions:Q; Why does nobody propose a ROLI rd rs1 imm instruction ?A: because it is functionally equivalent toRORI rd rs1 (XLEN - imm)Uwhich because of modular arithmetic mod XLEN is equivalent toRORI rd rs (-imm)USince for immediates we can just let the compiler compute that (-imm) (or XLEN - imm),a ROLI instruction is redundant.
The register versions ROL and ROR are not redundant:ROR rd rs1 rs2andROL rd rs1 rs2are different instructions. However, if you have one, the other just replacesa two instruction sequence. E.g. if we have ROR then the macro/compiler generated sequenceROL_MACRO rd rs1 rs2; t # assume t != rs1{NEG t, rs2ROR rd, rs1, t}could be used instead of a ROL instruction.
Now consider a hypothetical instruction SLLNEGI rd rs1 rs2 with semanticsSLLNEGI rd rs1 immwith the semantics ofSLLI rd rs1 (XLEN - imm)U(and similarly for SRLNEGI <-> SRLI and SRANEGI <-> SRAI).Again because of modular arithmetic mod XLEN, which applies because in RVshifts have an rs2/immediate that is subject to modular arithmetic mod XLEN(i.e. rs2 &= (XLEN-1) ), this is equivalent toSLLI rd rs1 (-imm)U.Thus, a SLLNEGI instruction would be completely redundant just like ROLI is. It is not a useless concept, however:expressions like this crop up all the time in sign extension, extraction of bitfields, the sequence neededto emulate funnel shifts, or even rotations if you don't want ROR[I] as a separate instruction.The corresponding register versionsSLL rd rs1 rs2andSLLNEG rd rs1 rs2are not equivalent, but just like ROL and ROR, are different instructions. However, just as in the case of ROL they only replacea two instruction macro/compiler generated sequenceSLLNEG_MACRO rd rs1 rs2; t # assume t != rs1{NEG t, rs2SLL rd, rs1, t}Now while a NEG version would not seem to give a huge benefit, a negation (minus) on rs2 is already there for SUB. Hence, having ROL as well as ROR is probably cheap.
But then it should be equally cheap, perhaps even cheaper, to also have a NEG version of other shifts, i.e. SLLNEG, SRLNEG and SRANEG. The cost will be dominated by decoding so largely depends on encoding details which may change in case SLO and friends get dropped. Shifts are more prevalent than rotates, so if ROL seems worth it, that suggests so are SLLNEG, SRLNEG and SRANEG, if only to make the register version of the patterns for bit field extraction shorter. But frankly, I doubt a case can be made for any of them because the register versions of shifts/rotates are are so rare.It seems x86 has a bewildering array of ROR's and ROL's, but ARM has just a ROR. Thus, it seems likely that primitives like the PCG random number generator which would work with both ROL and ROR, would gravitate towards picking ROR. YMMV.
Not really, the minus is generally integrated into the add alu, implemented as rs1 + ~rs2 + 1, only bitwise not on rs2 is separable. A separate neg alu would usually be required or the low bits of the add/sub alu could be used, though I was assuming that most high-performance implementations would have a barrel shifter that can do both left and right rotate, maybe by doing an optional bit-reverse at the beginning and end, which would also eliminate the need for sll since srl is the bit-reversed version.
It seems x86 has a bewildering array of ROR's and ROL's, but ARM has just a ROR. Thus, it seems likely that primitives like the PCG random number generator which would work with both ROL and ROR, would gravitate towards picking ROR. YMMV.I was not aware that other architectures had only 1 rotate direction. spending a little time with rust.godbolt.org:x86 has both rol and ror (as well as rcl and rcr, but those are very rarely used)arm/aarch64 has ror.mips has ror.s390 has rol.powerpc has rol.webassembly has both rol and ror.sparc has neither.so, I think we should have a ror instruction.
Jacob
One potential solution is to use RVC register numbering or similar. 10 bits can be used to store 3 registers 3,3,4. If encoding across all 3 fields rd, rs1 and rs2, that is 15 bits which could be allocated 4, 4, 4, 3
It is not ideal however neither is taking up the entire OP32 space with Ternary operations each requiring 4 registers.
A little thought may find an overlap with existing RVC encoding.
The other alternative is a destructive ternary operation which uses one of the src operands as a dest. This is less than ideal too however, again, none of the options are.
Thoughts welcomed as to the benefits of each. Are there any other options?
L.
On 03/10/2019, lkcl <luke.l...@gmail.com> wrote:
> Thoughts welcomed as to the benefits of each. Are there any other options?
Yes, long instruction encoding
This is one reason to be wary of ternary ops for sure. It's
unfortunate the F/D extension wastes so much opcode space with FMA
while FNMADD, FNMSUB could be done with macro op fusion.
I also note CMOV can be done in terms of CMIX and SLT+NEG, the latter
which i think may deserve its own opcode, as it seems quite useful in
bit-twiddling code [0] and could be more readily fused. But i'll leave
that to folks with more experience...
[0] For another example, min x y = y ^ ((x ^ y) & -(x < y))
Oh. Yes. There's something specially in the OpenCL and Vulkan specs about this. Some less accurate situations permit FMUL and FADD to be used and there is an OpenCL extended opcode to cover that.
L.
> In much or most floating point code, FMA and friends are *the* most
> common FP operation.
Makes sense.
Rogier came up with some nice bitmanip macro op fusion sequences for ternary ops.
Keeping RV clean (no 4,3,3) sensible.
Need to go through all options for completeness. Miss something, kick ourselves when it's too late.
L.
> To clarify: i meant the following:
> FNMADD x, a, b = FMADD x, a, b; FNEG x, x
> FNMSUB x, a, b = FMSUB x, a, b; FNEG x, x
> which should be exactly equivalent, to my knowledge.
On the face of it, it seems reasonable, the problem is the NaN interaction, rounding and so on produce different answers. John Hauser's work is the best reference here, in hardfloat and softfloat.
hardfloat-verilog is around 250 lines of the most stunningly densely packed and yet still readable logic I have ever seen (most verilog that dense is flat out unreadable).
The core of the issue is that the intermediary register between the MUL and the ADD loses precision: rounding is performed *twice*.
Anyway.
OT :)
Rogier knows the macro ops, will leave it to him to reply.
L.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAP8PnuTgQP1%2BfORpO6SXyVBkY_OfuUqjfRs02%3D%2B_2%3DTWHKpm%3Dw%40mail.gmail.com.
On Fri, Oct 4, 2019 at 4:24 AM lkcl <luke.l...@gmail.com> wrote:
>
> Ternary operations are so ultra expensive, literally taking up an entire major opcode each (minus 3 bits) that the inclusion of so many of them in BitManip has me concerned.
*Nothing* is included in BitManip at this point. Some people have said
they want certain ternary ops, so they are being *evaluated* to see if
they are worth the very large cost. My answer to that is they are not,
and I believe Clifford thinks the same, at least for general purpose
processors.
Some people in embedded applications may not ever want to use that
opcode space for anything else and therefore be happy to pay that
encoding space cost and also the cost of a 3rd read port (per
execution pipe, for superscalar) on the integer register file. To
support them we should standardize on the opcodes so that the people
who want them can have assembler and compiler support.
It can even be ok if those opcodes overlap some other extension as
long as those extensions are marked as mutually incompatible.
Beside interest of its own, it can be used to emulate the general case withCMOV1 rd rs1 rs2 rs3. # rs1 ? rs2 : rs3 == (rs1? (rs2 ^ rs3) : 0) ^ rs3 assume rd != rs1, rs3
{
XOR rd, rs2, rs3, assume rd != rs1, rs3
AND.NE rd, rs1, rd
XOR rd, rd, rs3
}Again this a fusable sequence if you are careful about the assumptions.(with an additional instructionAND.EQ rd rs1 rs2 # rd = (!rs1)?rs2: 0 == (-!rs1)& rs2 == (~ -!!rs1)& rs2you could do something similar to MIX2 but I am not sure it is worth it. Even without that you can do the analogue of MIX3.)
Makes sense.
Rogier came up with some nice bitmanip macro op fusion sequences for ternary ops.
Keeping RV clean (no 4,3,3) sensible.
Need to go through all options for completeness. Miss something, kick ourselves when it's too late.
Rogier, the AND.NE etc make a lot of sense as macro op fused overlays to construct ternary ops.
Two different CMOV sequences, both with different assumptions, could be used under mutually alternate circumstances.
Intuitively I can foresee that in a bitlevel fashion the same tricks using ANDN etc could be used to do ternary bit selection i.e. bit mixing from 2 sources.
A hypothetical NAND and NOR on the other hand, these are symmetrical and consequently lack the same "clout".What *can* they be used for which AND/OR followed by NOT cannot achieve? Any loops where 3 or 4 instructions is a heavy penalty?
L.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/bdd8a81a-031e-4d5b-88b8-8b1bf2e6fc14%40groups.riscv.org.
Some projects plan to base multiprocessor systems (including supercomputers) on RISC-V.I wonder what is the effect of extensions like atomic operations (as well as vector operations, etc.)on the pipeline in multiprocessor (and/or multithread) environments.The machine instruction level solution to provide atomic operations can be acceptablewhen concerting the operation of just two threads (and/or processors), but can be tragic in thecase of high-performance (parallelized sequential) environment.
RegardsJanos
On Fri, Oct 4, 2019 at 8:52 PM lkcl <luke.l...@gmail.com> wrote:
Rogier, the AND.NE etc make a lot of sense as macro op fused overlays to construct ternary ops. Two different CMOV sequences, both with different assumptions, could be used under mutually alternate circumstances.--Intuitively I can foresee that in a bitlevel fashion the same tricks using ANDN etc could be used to do ternary bit selection i.e. bit mixing from 2 sources.A hypothetical NAND and NOR on the other hand, these are symmetrical and consequently lack the same "clout".What *can* they be used for which AND/OR followed by NOT cannot achieve? Any loops where 3 or 4 instructions is a heavy penalty?L.
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa...@groups.riscv.org.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/3d843540-d5c6-44b5-a9df-aaa8c94d29e3%40groups.riscv.org.
This has been mentioned before, but Alpha had a cmov rd = cond?rs1:rs2, but only implemented a microarchitecture with 2 inputs.And since it was out of order, it renamed all destinations.Their implementation cracked the op into acmovT rd= cond?src1:rd but really if(cond) {rd=src1}cmovF rd= Cond?rd:src2 but really if(!cond) {rd=src2}so didn't need to fetch rd at all , and special cased renaming so it renamed rd only once.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DAxf%2B_mtmO6F_f-_-n7EdWeon37ZeFf-gVi60X0Bon1RQ%40mail.gmail.com.