[RFC] Bit Manipulation Instructions

771 views
Skip to first unread message

Michael Clark

unread,
Oct 28, 2017, 2:19:08 PM10/28/17
to RISC-V ISA Dev
Hi Folks,

Bit manipulation instructions have been on my mind for a while as I am interesting in creating some encodings so that we can add candidate instructions to a branch of GCC so that we can start looking at stats such as dynamic retired instructions or cycle counts. I’m not yet a member of the BitManip Group in the riscv.org workgroup system so I’m posting here to ISA dev. I haven’t seen any public updates from the BitManip group, and was expecting there would be some open discussion on ISA-dev (a la IETF).

I recently noticed Clifford Wolf’s [1] analysis of BEXT and BDEP cores (via Twitter) so i’ve adopted consistent naming for Parallel Extract and Deposit. I’ve used Bit as the prefix for all instructions except for Rotate. Rotate can likely fit into opcode space around the existing shift instructions and is trivial from an implementation and area perspective if one already has a barrel shifter. I don’t have GREV for Generalised Bit Reversal as currently my list contains instructions that map to compiler builtins or compiler lifting i.e. have widespread adoption in many different commodity processors. Every instruction with the exception of BEXT/BDEP either has a compiler builtin or lift and is in widespread use.

One of the issues is whether RV64 has 32-bit W variants of the bit manipulation instructions. This obviously does not affect RV32. My argument for W variants is that RV64 cores can pay the price for a constant offset shifter for BREV (bit reverse) and masking for BCNT (popcount). Shifts already have 32-bit variants and 32-bit rotate variants are necessary for a large number of crypto ciphers. The argument for BSWAPW on RV64 is one of ISA consistency and also its frequency of use. The macro-op fusion argument or 32-bit byte swapping (BSWAP plus SRL) punishes simpler RV64 implementations that can’t afford macro-op fusion.

I’ve also written a little about a possible area efficient implementation of BCLZ and BCTZ (Count Leading and Trailing Zeros) [2],[3]. I have made a high level diagram of an approach that saves area at the expense of critical path length. Below I look at the size of the instruction sequences for BCLZ and BCTZ constructed in terms of BREV and BCNT. They are certainly at the threshold of being worthwhile to implement in hardware. Adding shifts and masks for 32-bit versions will make the construction in terms of other primitives even longer. If there is a way to implement them in a small area then I think they are worthwhile instructions as they are available on most commodity processors. ARM and x86 both have count leading and trailing zeros (x86 has BSF/BSR and LZCNT, AMD also has TZCNT although these duplicate BSF/BSR). From searching for the builtin_ctz, builtin_ctzll, builtin_clz and builtin_clzll, in code search engines, they appear to be very frequently used.

There is nothing very exotic in the proposal below. As I mentioned,. all of the instructions are presently lifted by GCC and Clang (rotate) or have builtins, with the exception of BEXT/BDEP. I’ve read the papers about BEXT/BDEP, in particular benchmarks compressing nucleotide sequences for BLAST. and have contacted a BLAST user at LBL however from looking at the BLAST codes, it appears that the accelerated versions in the papers have not been mainlined. i.e. it is hard to find code that uses PEXT/PDEP. I’ve included them as they seem to have some potential. I need to learn about how GREV can be used. I only heard about GREV via Clifford’s mailing list post.

It would be nice if we could have some updates from the Bit Manip group on the mailing list (a la IETF). IETF and CFRG deal with patent issues but that doesn’t seem to be a barrier to open discussion on the lists. The patented techniques tend to get surfaced in the open and are excluded from being used in any IETF RFCs.

As I mentioned, there is nothing really exotic in this proposal besides BEXT/BDEP, and the count leading and trailing zeros can be constructed in an area efficient manner, so the next step might be to assign some opcode space and map these to builtins in GCC to allow us to perform some simulation… In fact count leading zeros is one of the most commonly instructions in my code search (it is used in big integer and soft float codes for re-normalisation as well as having many other uses). The open coded count leading zeros in terms of BREV and BCNT along with the set trailing zeros to ones ((x & -x) - 1) is large enough that this pipeline could be contructed in hardware re-using the BREV and BCNT functional units.

Michael.

[1] https://github.com/cliffordwolf/bextdep
[2] https://github.com/rv8-io/rv8/blob/master/doc/src/bmi.md
[3] https://github.com/rv8-io/rv8/blob/master/doc/src/bitmanip.png


## Candidate instructions for B extension (Bit Manipulation)

Opcode Long Name Description
RLL[W] rd,rs1,rs2 Rotate Left Logical Rotate bits in rs1 left by the amount in rs2
RRL[W] rd,rs1,rs2 Rotate Right Logical Rotate bits in rs1 right by the amount in rs2
RLLI[W] rd,rs1,shamt Rotate Left Logical Immediate Rotate bits in rs1 left by the immediate
RRLI[W] rd,rs1,shamt Rotate Right Logical Immediate Rotate bits in rs1 right by the immediate
BCLZ[W] rd,rs1 Bit Count Leading Zeros Count leading zero bits in rs1
BCTZ[W] rd,rs1 Bit Count Trailing Zeros Count trailing zero bits in rs1
BCNT[W] rd,rs1 Bit Count Count number of bits set in rs1
BREV[W] rd,rs1 Bit Reverse Reverse bits in rs1
BSWAP[W] rd,rs1 Byte Swap Swap byte order in rs1
BEXT[W] rd,rs1,rs2 Bit Extract Gather LSB justified bits to rd from rs1 using extract mask in rs2
BDEP[W] rd,rs1,rs2 Bit Deposit Scatter LSB justified bits from rs2 to rd using deposit mask in rs2


## Rotate

RLL[W] rd,rs1,rs2 Rotate Left Logical Rotate bits in rs1 left by the amount in rs2
RRL[W] rd,rs1,rs2 Rotate Right Logical Rotate bits in rs1 right by the amount in rs2
RLLI[W] rd,rs1,shamt Rotate Left Logical Immediate Rotate bits in rs1 left by the immediate
RRLI[W] rd,rs1,shamt Rotate Right Logical Immediate Rotate bits in rs1 right by the immediate

Rotate instructions can be implemented very easily with the addition of carry-out carry-in logic to an existing shifter. Rotates are very simple instructions but they are frequently used in cryptographic ciphers so the small saving in cycles (3:1) is likely worth the additional area for processors that implement the B extension.


## Byte swap

BSWAP[W] rd,rs1 Byte Swap Swap byte order in rs1

Byte swapping instructions are essential for networking code and cryptographic ciphers which typically use big endian formats. A 32-bit byte swap takes ~14 instructions and a 64-bit byte swap takes ~30 instructions.

bswap_32:
slliw a3,a0,24
srliw a5,a0,24
slliw a4,a0,8
or a5,a5,a3
li a3,16711680
and a4,a4,a3
or a5,a5,a4
li a4,65536
addi a4,a4,-256
srliw a0,a0,8
and a0,a0,a4
or a0,a5,a0
sext.w a0,a0
ret

bswap_64:
slli a2,a0,56
srli a5,a0,56
or a5,a5,a2
li a2,65536
srli a3,a0,40
addi a2,a2,-256
and a3,a3,a2
li a4,255
srli a2,a0,24
or a5,a5,a3
li a3,16711680
and a2,a2,a3
slli a1,a4,24
srli a3,a0,8
and a3,a3,a1
or a5,a5,a2
slli a1,a4,32
slli a2,a0,8
and a2,a2,a1
or a5,a5,a3
slli a1,a4,40
slli a3,a0,24
or a5,a5,a2
and a3,a3,a1
slli a4,a4,48
slli a0,a0,40
or a5,a5,a3
and a0,a0,a4
or a0,a5,a0
ret


## Count Leading and Trailing Zeros

BCLZ[W] rd,rs1 Bit Count Leading Zeros Count leading zero bits in rs1
BCTZ[W] rd,rs1 Bit Count Trailing Zeros Count trailing zero bits in rs1
BCNT[W] rd,rs1 Bit Count Count number of bits set in rs1
BREV[W] rd,rs1 Bit Reverse Reverse bits in rs1

Count leading and trailing zeros can be constructed using popcount (BCNT) and bit reverse (BREV) instructions or functional units. Given the length of the instruction sequences and the possibility to multiplex via shared bit reverse and popcount functional units, the count leading and trailing zeros instructions are included.

One area saving implementation approach involves multiplexing functional units in a pipeline containing bit reverse, neg+and+sub (leading/traling) and popcount. The critical path the length for these bit manipulation instructions would be the fanout of the bit reverse circuit, the neg+and+sub (set trailing zeros to ones) circuit for (BCLZ) and (BCTZ) and the popcount circuit. The popcount instruction (BCNT) would have an input bypass and bit reverse instruction (BREV) would have an output bypass.

- https://github.com/rv8-io/rv8/blob/master/doc/src/bitmanip.png

Figure 1: Area efficient implementation approach for count leading and trailing zeros

Count leading and trailing zeros could be constructed in terms of bit reverse and popcount instructions however the area saving is a few muxes and the set trailing zeros to ones circuit (x & -x) - 1). The major area contributors with either approach are the bit reverse and popcount functional units.

The following shows C and assembler for count leading and trailing zeros constructed in terms of (BREV) bit reverse and (BCNT) popcount.

C

#include <stdint.h>

uint32_t bctz_w(uint32_t x) { return __builtin_popcount((x & -x) - 1); }
uint64_t bctz_d(uint64_t x) { return __builtin_popcountll((x & -x) - 1); }
uint32_t bclz_w(uint32_t x) { uint32_t y = __builtin_bitreverse32(x); return __builtin_popcount((y & -y) - 1); }
uint64_t bclz_d(uint64_t x) { uint64_t y = __builtin_bitreverse64(x); return __builtin_popcountll((y & -y) - 1); }

Asssembler

# count trailing zeros
.macro BCTZ.X rd, rs
neg \rd, \rs
and \rd, \rd, \rs
addi \rd, \rd, -1
bcnt \rd, \rd
.endm

# count leading zeros
.macro BCLZ.X rd, rs
brev t0, \rs
neg \rd, t0
and \rd, \rd, t0
addi \rd, \rd, -1
bcnt \rd, \rd
.endm


## Parallel Bit Extract and Deposit

BEXT[W] rd,rs1,rs2 Bit Extract Gather LSB justified bits to rd from rs1 using extract mask in rs2
BDEP[W] rd,rs1,rs2 Bit Deposit Scatter LSB justified bits from rs2 to rd using deposit mask in rs2

Bit Extract

Gather LSB justified bits to rd from rs1 using extract mask in rs2. Bits are extracted from the source register using bit positions enabled in the mask register and are placed in the result as popcount(mask) LSB contiguous bits.

Register Value
source 0b11110100
mask 0b01100011
result 0b00001100

Bit Deposit

Scatter LSB justified bits from rs2 to rd using deposit mask in rs2. Bits are deposited from the source register using popcount(mask) LSB contiguous bits and are placed in the result at the bit positions enabled in the mask register.

Register Value
source 0b11110100
mask 0b01100011
result 0b00100000

Jacob Bachmeyer

unread,
Oct 29, 2017, 12:02:02 AM10/29/17
to Michael Clark, RISC-V ISA Dev
Michael Clark wrote:
> Hi Folks,
>
> Bit manipulation instructions have been on my mind for a while as I am interesting in creating some encodings so that we can add candidate instructions to a branch of GCC so that we can start looking at stats such as dynamic retired instructions or cycle counts. I’m not yet a member of the BitManip Group in the riscv.org workgroup system so I’m posting here to ISA dev. I haven’t seen any public updates from the BitManip group, and was expecting there would be some open discussion on ISA-dev (a la IETF).
>

There were some discussions on ISA-dev before BitManip kicked off, but I
share your disappointment. I was hoping that RISC-V development would
follow the IETF model.

> I recently noticed Clifford Wolf’s [1] analysis of BEXT and BDEP cores (via Twitter) so i’ve adopted consistent naming for Parallel Extract and Deposit. I’ve used Bit as the prefix for all instructions except for Rotate. Rotate can likely fit into opcode space around the existing shift instructions and is trivial from an implementation and area perspective if one already has a barrel shifter. I don’t have GREV for Generalised Bit Reversal as currently my list contains instructions that map to compiler builtins or compiler lifting i.e. have widespread adoption in many different commodity processors. Every instruction with the exception of BEXT/BDEP either has a compiler builtin or lift and is in widespread use.
>
> One of the issues is whether RV64 has 32-bit W variants of the bit manipulation instructions. This obviously does not affect RV32. My argument for W variants is that RV64 cores can pay the price for a constant offset shifter for BREV (bit reverse) and masking for BCNT (popcount). Shifts already have 32-bit variants and 32-bit rotate variants are necessary for a large number of crypto ciphers. The argument for BSWAPW on RV64 is one of ISA consistency and also its frequency of use. The macro-op fusion argument or 32-bit byte swapping (BSWAP plus SRL) punishes simpler RV64 implementations that can’t afford macro-op fusion.
>

I previously (message-id <58968596...@gmail.com>
<URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/58968596.7060903%40gmail.com>)
suggested using bit 29 to encode "shift-mask" instructions and
(message-id <5897D8D4...@gmail.com>
<URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5897D8D4.4040007%40gmail.com>)
the combination of bit 30 and bit 29 to encode "rotate" packed around
the existing shift instructions. If this approach is taken, then the
same encoding applies easily to the 32-bit S[LR]*W variants and the
future 64-bit S[LR]*D variants RV128 is expected to introduce.

The shift-mask instructions I previously proposed reduce "BCNTW" to
"SLMI t0, zero, 32 / AND t0, src, t0 / BCNT dest, t0". Without
shift-mask, BCNTW requires "LUI t0, 0x80000 / SLLI t0, t0, 1 / NOT t0,
t0 / AND t0, src, t0 / BCNT dest, t0". Since shift-mask has trivial
implementation costs (one additional wire from decode to ALU, replacing
the current hardwired "shift-in-zero", simplify instruction validation
logic to ignore bit 29 when decoding a shift or rotate), I believe that
shift-mask should be included in RVB and BCNTW is unneeded. Note that
BCNTW is three instructions with shift-mask and five instructions
without shift-mask.

BREV, BREVW, BSWAP, and BSWAPW can be trivially implemented in terms of
GREV, so do not need explicit opcodes of their own. One opcode (GREV)
gives us four wanted functions, so is a better choice.

> I’ve also written a little about a possible area efficient implementation of BCLZ and BCTZ (Count Leading and Trailing Zeros) [2],[3]. I have made a high level diagram of an approach that saves area at the expense of critical path length.

Critical path length will be more important than area in
high-performance systems, especially now that systems are limited by
substrate power dissipation capacity. Area, especially for circuits
that are not active on every cycle, is becoming a sunk cost, while
critical path length still limits operating frequency.

> Below I look at the size of the instruction sequences for BCLZ and BCTZ constructed in terms of BREV and BCNT. They are certainly at the threshold of being worthwhile to implement in hardware. Adding shifts and masks for 32-bit versions will make the construction in terms of other primitives even longer. If there is a way to implement them in a small area then I think they are worthwhile instructions as they are available on most commodity processors. ARM and x86 both have count leading and trailing zeros (x86 has BSF/BSR and LZCNT, AMD also has TZCNT although these duplicate BSF/BSR). From searching for the builtin_ctz, builtin_ctzll, builtin_clz and builtin_clzll, in code search engines, they appear to be very frequently used.
>

LZCNT/TZCNT do not duplicate BSF/BSR; IIRC the latter are microcoded
while the former are hardwired. No, I do not know why.

That said, since CTZ and CLZ are commonly-used operations and have no
meaningful immediate forms (thus not needing space in the very limited
gaps remaining in OP-IMM, but only needing function codes in OP) I agree
that they are useful, unless we have a more generic operation that can
easily construct them. As Bruce Hoult explained in message-id
<CAMU+Ekza-2mwoy7p2qxCFpmtho1-uaE8xVkUvw0295yPXrB=N...@mail.gmail.com>
<URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU+Ekza-2mwoy7p2qxCFpmtho1-uaE8xVkUvw0295yPXrB=Nw%40mail.gmail.com>,
BCTZ can be constructed from BCNT (there called POPC) using three
instructions. BCLZ is thus a four instruction sequence. Are these used
frequently enough to justify assigning them hardware opcodes?

> There is nothing very exotic in the proposal below. As I mentioned,. all of the instructions are presently lifted by GCC and Clang (rotate) or have builtins, with the exception of BEXT/BDEP. I’ve read the papers about BEXT/BDEP, in particular benchmarks compressing nucleotide sequences for BLAST. and have contacted a BLAST user at LBL however from looking at the BLAST codes, it appears that the accelerated versions in the papers have not been mainlined. i.e. it is hard to find code that uses PEXT/PDEP. I’ve included them as they seem to have some potential. I need to learn about how GREV can be used. I only heard about GREV via Clifford’s mailing list post.
>

I still oppose BEXT/BDEP/PEXT/PDEP for two main reasons: the enormous
hardware complexity and the possible patent risks.

> It would be nice if we could have some updates from the Bit Manip group on the mailing list (a la IETF). IETF and CFRG deal with patent issues but that doesn’t seem to be a barrier to open discussion on the lists. The patented techniques tend to get surfaced in the open and are excluded from being used in any IETF RFCs.
>

This is a large advantage for an open standard, and the largest concern
I have about side groups behind closed doors is the risk of perverse
incentives on this front. I understand that many standards committees
end up producing inferior results because the participants are often
more interested in shoehorning as many patented details as they can into
the specification to maximize their own royalties from implementors,
than they are interested in producing an optimal result. I hope that
the RISC-V Foundation has a categorical prohibition on the use of
patented techniques in RISC-V standards.

> As I mentioned, there is nothing really exotic in this proposal besides BEXT/BDEP, and the count leading and trailing zeros can be constructed in an area efficient manner, so the next step might be to assign some opcode space and map these to builtins in GCC to allow us to perform some simulation… In fact count leading zeros is one of the most commonly instructions in my code search (it is used in big integer and soft float codes for re-normalisation as well as having many other uses). The open coded count leading zeros in terms of BREV and BCNT along with the set trailing zeros to ones ((x & -x) - 1) is large enough that this pipeline could be contructed in hardware re-using the BREV and BCNT functional units.
>

If CLZ is the most heavily used, and the longest sequence of its group
when constructed from lower-level primitives, assigning a hardware
opcode to it sounds like a good idea. How does its usage compare with CTZ?
[...] [descriptions removed to shorten lines]

> ## Rotate
>
> RLL[W] rd,rs1,rs2 Rotate Left Logical
> RRL[W] rd,rs1,rs2 Rotate Right Logical
> RLLI[W] rd,rs1,shamt Rotate Left Logical Immediate
> RRLI[W] rd,rs1,shamt Rotate Right Logical Immediate
>
> Rotate instructions can be implemented very easily with the addition of carry-out carry-in logic to an existing shifter. Rotates are very simple instructions but they are frequently used in cryptographic ciphers so the small saving in cycles (3:1) is likely worth the additional area for processors that implement the B extension.
>

I also propose "shift-mask" instructions:

SLM[W] rd, rs1, rs2 Shift Left Mask
SRM[W] rd, rs1, rs2 Shift Right Mask
SLMI[W] rd, rs1, shamt Shift Left Mask Immediate
SRMI[W] rd, rs1, shamt Shift Right Mask Immediate

These are similar to the base ISA SLL/SRL/SLLI/SRLI, except that they
shift in ones instead of zeros. Implementation using bit 29 to select
shift-mask requires as little as a
single wire from decode to ALU replacing the base ISA's hardwired
shift-in-zeros.

Combining the two, setting both bit 30 and bit 29 in a shift instruction
selects rotate instead of shift.

> ## Byte swap
>
> BSWAP[W] rd,rs1 Byte Swap
>

GREV handles this: BSWAPD == "GREV 56"/"GREV 32+16+8", BSWAPW == "GREV
24"/"GREV 16+8"

GREVI rd, rs1, k General Reverse Immediate

See Clifford Wolf's message-id <2017102812...@clifford.at>
<URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/20171028120020.GD2798%40clifford.at>
for an explanation of GREV.

> ## Count Leading and Trailing Zeros
>
> BCLZ[W] rd,rs1 Bit Count Leading Zeros
> BCTZ[W] rd,rs1 Bit Count Trailing Zeros
> BCNT[W] rd,rs1 Bit Count
> BREV[W] rd,rs1 Bit Reverse
>
> Count leading and trailing zeros can be constructed using popcount (BCNT) and bit reverse (BREV) instructions or functional units. Given the length of the instruction sequences and the possibility to multiplex via shared bit reverse and popcount functional units, the count leading and trailing zeros instructions are included.
>
> One area saving implementation approach involves multiplexing functional units in a pipeline containing bit reverse, neg+and+sub (leading/traling) and popcount. The critical path the length for these bit manipulation instructions would be the fanout of the bit reverse circuit, the neg+and+sub (set trailing zeros to ones) circuit for (BCLZ) and (BCTZ) and the popcount circuit. The popcount instruction (BCNT) would have an input bypass and bit reverse instruction (BREV) would have an output bypass.
>

These are good, but I see no reason to prefix these mnemonics with "B";
I suggest CLZ/CTZ/POPCNT for the first three, and dropping BREV
entirely, since GREV fulfills that role and more.

[...]

> ## Parallel Bit Extract and Deposit
>
> BEXT[W] rd,rs1,rs2 Bit Extract
> BDEP[W] rd,rs1,rs2 Bit Deposit
>
> [...]
>

I still believe that these two are trouble. Between very limited use
(and coprocessors may be more appropriate for those uses), the enormous
hardware complexity (each bit is shifted an amount that depends on the
number of less-significant bits set in the control word), and the
possible patent risks (these appear to have been introduced in the x86
ISA, which is a known patent thicket possibly with
difficult-to-recognize "submarine patents" included; find a 20-year-old
system with something similar), RVB is probably better off without
BEXT/BDEP/PEXT/PDEP.


-- Jacob

Michael Clark

unread,
Oct 29, 2017, 12:30:00 AM10/29/17
to jcb6...@gmail.com, RISC-V ISA Dev


> On 29/10/2017, at 5:01 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Critical path length will be more important than area in high-performance systems, especially now that systems are limited by substrate power dissipation capacity. Area, especially for circuits that are not active on every cycle, is becoming a sunk cost, while critical path length still limits operating frequency.

Having separate functional units is a viable implementation approach for high performance processors. That was not the point of my diagram which was to satisfy those who are area constrained with one possible implementation approach.

Separate opcodes in indeed required for the most high performance implementations if we want an ISA that doesn’t presuppose or constrain implementation choices.

Having separate opcodes for CLZ/CTZ doesn’t dictate one implementation approach or the other. The point was more that a small area approach is viable while having distinct opcodes.

One ideally wants to optimise for frequency. CLZ/CTZ/BSWAP will likely be used more frequently in typical code compared to GREVI unless we have a design that exposes or dictates one particular implementation approach e.g. CLZ32 on RV64 implemented in terms of SRL; GREVI 31; NEG; AND; ADDI -1; BCNT. IMHO that would be an ISA design mistake. It’s too complex to be macro-fused.

Jacob Bachmeyer

unread,
Oct 29, 2017, 1:30:27 AM10/29/17
to Michael Clark, RISC-V ISA Dev
Michael Clark wrote:
>> On 29/10/2017, at 5:01 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> Critical path length will be more important than area in high-performance systems, especially now that systems are limited by substrate power dissipation capacity. Area, especially for circuits that are not active on every cycle, is becoming a sunk cost, while critical path length still limits operating frequency.
>>
>
> Having separate functional units is a viable implementation approach for high performance processors. That was not the point of my diagram which was to satisfy those who are area constrained with one possible implementation approach.
>

It is a possible approach, but the trade-off is not always that simple
to actually evaluate.

> Separate opcodes in indeed required for the most high performance implementations if we want an ISA that doesn�ft presuppose or constrain implementation choices.
>
> Having separate opcodes for CLZ/CTZ doesn�ft dictate one implementation approach or the other. The point was more that a small area approach is viable while having distinct opcodes.
>

I never said that CLZ/CTZ should not be their own opcodes -- having only
one source operand means that they have no meaningful immediate form and
therefore will only take up space in the much less crowded
OP/OP-32/OP-64 major opcodes rather than needing space in OP-IMM which
is almost full. GREVI happens to have a small immediate that can fit
alongside an immediate shift instruction.

> One ideally wants to optimise for frequency. CLZ/CTZ/BSWAP will likely be used more frequently in typical code compared to GREVI unless we have a design that exposes or dictates one particular implementation approach e.g. CLZ32 on RV64 implemented in terms of SRL; GREVI 31; NEG; AND; ADDI -1; BCNT. IMHO that would be an ISA design mistake. It�fs too complex to be macro-fused.
>

The advantage for GREVI is that it provides one instruction the can
perform both bit reverse and byte swap.

To be clear, unless hard minimalism is a goal for RVB, I support
distinct CLZ/CTZ opcodes. I would prefer RVB to include, in descending
order of importance: GREVI (which provides BREV, BSWAP, and more),
POPCNT, rotate, CLZ, CTZ, and mask-shift instructions, for a total of 12
new RV32 instructions. The rotate and mask-shift instructions would be
packed amongst the baseline shift opcodes, using the existing bit 30 to
select arithmetic shift (as the baseline ISA does), bit 29 to select
mask shift and the combination of bit 30 and bit 29 to select rotate.
GREVI could be encoded as SLLI with bit 30 set, since "left arithmetic
shift" makes no sense. This preserves bits 27 and 28 for future
extensions and reserves space for GREV near SLL if that is needed in the
future. POPCNT, CLZ, and CTZ all lack meaningful immediate forms, so can
be encoded in OP/OP-32/OP-64 easily with new funct7 codes, or since
these instructions have only a single source operand, one new funct7
code and using the rs2 field to select an opcode.

AND-NOT and OR-NOT have been suggested, and could be easily encoded near
the current AND and OR, but would not have immediate forms at all due to
lack of encoding space. This would break an elegant symmetry in RISC-V.
I omit them for this reason.

Narrower forms of the rotate and mask-shift instructions are assumed to
follow the pattern of the existing shift instructions adding 8
instructions each to RV64IB and RV128IB. Encoding GREV near shift, as I
advocate, may make narrower GREV easier to implement as well, but GREV
can easily be used with care to process narrower data. Maintaining sign
extension of narrower values would require narrower forms of all of
these to be implemented.

This leaves POPCNT, CLZ, and CTZ. CTZ is inherently correct for narrower
data, so not a problem provided that a result greater than the narrower
word size is properly handled by software. Narrow CLZ can be handled
with native GREVI and native CTZ, with the same constraint on handling
the result. Narrow POPCNT requires masking the input, which can be done
with an "SLMI / AND / POPCNT" sequence. The big question then is how
frequently is narrow POPCNT actually used?


If hard minimalism is a goal, the very smallest RVB is POPCNT and GREVI.
If this is to be the case, I suggest encoding GREVI as a "left
arithmetic shift" (left-shift opcode with bit 30 set) and POPCNT
somewhere in the OP space; SLT and SLTU seem like convenient places to
assign a new funct7 code. This would add only 2 instructions, period,
but omits several operations that I expect to be very useful. I prefer
preserving the space near AND and OR for future AND-NOT and OR-NOT
instructions, should omitting them prove unwise.


-- Jacob

Michael Clark

unread,
Oct 29, 2017, 1:37:33 AM10/29/17
to jcb6...@gmail.com, RISC-V ISA Dev


> On 29/10/2017, at 5:01 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Michael Clark wrote:
>> Hi Folks,
>>
>> Bit manipulation instructions have been on my mind for a while as I am interesting in creating some encodings so that we can add candidate instructions to a branch of GCC so that we can start looking at stats such as dynamic retired instructions or cycle counts. I’m not yet a member of the BitManip Group in the riscv.org workgroup system so I’m posting here to ISA dev. I haven’t seen any public updates from the BitManip group, and was expecting there would be some open discussion on ISA-dev (a la IETF).
>>
>
> There were some discussions on ISA-dev before BitManip kicked off, but I share your disappointment. I was hoping that RISC-V development would follow the IETF model.

There was a comment at the CARRV workshop that we need a RISC-V Weekly News much like the excellent example set by LLVM Weekly News. However the LLVM mailing lists are a bit of a firehose than the RISC-V mailing lists. That’s not to say there is not a lot going on in RISC-V land, it’s just we don’t necessarily find out about it all on the mailing lists.

The issue is we need status updates shared more frequently with the mailing lists before something like a RISC-V Weekly News would be possible, and of course folk need to be able to share freely without academic pre-publication rules or proprietary industry NDAs hindering open discussion and/or updates.

In any case, I think i’m best off to just find some opcode space and making a compiler patch and take it from there, rather than try to reach some sort of consensus before proceeding.

These 9 will have easy lowering for a lot of existing code.

RLL[W] rd,rs1,rs2 Rotate Left Logical Rotate bits in rs1 left by the amount in rs2
RRL[W] rd,rs1,rs2 Rotate Right Logical Rotate bits in rs1 right by the amount in rs2
RLLI[W] rd,rs1,shamt Rotate Left Logical Immediate Rotate bits in rs1 left by the immediate
RRLI[W] rd,rs1,shamt Rotate Right Logical Immediate Rotate bits in rs1 right by the immediate
BCLZ[W] rd,rs1 Bit Count Leading Zeros Count leading zero bits in rs1
BCTZ[W] rd,rs1 Bit Count Trailing Zeros Count trailing zero bits in rs1
BCNT[W] rd,rs1 Bit Count Count number of bits set in rs1
BSWAP[W] rd,rs1 Byte Swap Swap byte order in rs1

These 3 are interesting but I don’t have any benchmarks that can use them immediately without some porting effort.

BEXT[W] rd,rs1,rs2 Bit Extract Gather LSB justified bits to rd from rs1 using extract mask in rs2
BDEP[W] rd,rs1,rs2 Bit Deposit Scatter LSB justified bits from rs2 to rd using deposit mask in rs2
GREVI[W] rd,rs1,imm7 Generalized Reverse Generalized Reverse of bits in rs1 accoring to mask in imm7

I’m sure they would be useful for BLAST(X/Y/Z) which likely occupies quite a bit compute time on some of the largest supercomputers in the world, but they will require application specific assembly to be fully exploited.

Jacob Bachmeyer

unread,
Oct 29, 2017, 1:58:37 AM10/29/17
to Michael Clark, RISC-V ISA Dev
Michael Clark wrote:
>> On 29/10/2017, at 5:01 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> Michael Clark wrote:
>>
>>> Hi Folks,
>>>
>>> Bit manipulation instructions have been on my mind for a while as I am interesting in creating some encodings so that we can add candidate instructions to a branch of GCC so that we can start looking at stats such as dynamic retired instructions or cycle counts. I’m not yet a member of the BitManip Group in the riscv.org workgroup system so I’m posting here to ISA dev. I haven’t seen any public updates from the BitManip group, and was expecting there would be some open discussion on ISA-dev (a la IETF).
>>>
>>>
>> There were some discussions on ISA-dev before BitManip kicked off, but I share your disappointment. I was hoping that RISC-V development would follow the IETF model.
>>
>
> There was a comment at the CARRV workshop that we need a RISC-V Weekly News much like the excellent example set by LLVM Weekly News. However the LLVM mailing lists are a bit of a firehose than the RISC-V mailing lists. That’s not to say there is not a lot going on in RISC-V land, it’s just we don’t necessarily find out about it all on the mailing lists.
>
> The issue is we need status updates shared more frequently with the mailing lists before something like a RISC-V Weekly News would be possible, and of course folk need to be able to share freely without academic pre-publication rules or proprietary industry NDAs hindering open discussion and/or updates.
>
> In any case, I think i’m best off to just find some opcode space and making a compiler patch and take it from there, rather than try to reach some sort of consensus before proceeding.
>
> These 9 will have easy lowering for a lot of existing code.
>
> RLL[W] rd,rs1,rs2 Rotate Left Logical Rotate bits in rs1 left by the amount in rs2
> RRL[W] rd,rs1,rs2 Rotate Right Logical Rotate bits in rs1 right by the amount in rs2
> RLLI[W] rd,rs1,shamt Rotate Left Logical Immediate Rotate bits in rs1 left by the immediate
> RRLI[W] rd,rs1,shamt Rotate Right Logical Immediate Rotate bits in rs1 right by the immediate
>

I have been proposing an encoding for these since February: like shift,
but with bits 30 and 29 both set.

> BCLZ[W] rd,rs1 Bit Count Leading Zeros Count leading zero bits in rs1
> BCTZ[W] rd,rs1 Bit Count Trailing Zeros Count trailing zero bits in rs1
> BCNT[W] rd,rs1 Bit Count Count number of bits set in rs1
>

These can go near SLT, but with bit 30 set and a sub-function code
(POPCNT/CTZ/CLZ) in the rs2 field. (Rationale for choosing SLT:
SLT/SLTU are the only ALU operations that generate a value inside the
ALU rather than deriving it bitwise from the inputs; POPCNT/CTZ/CLZ also
generate a value inside the ALU with bits not directly traceable to
operand bits.)

> BSWAP[W] rd,rs1 Byte Swap Swap byte order in rs1
>

I suggest making this a special case of GREVI, and encoding GREVI like
SLLI, but with bit 30 set. The shamt field becomes GREVI's "k"
parameter. As such, BSWAP on RV64 is {6'b010000, 6'd56, rs1, 3'b001,
rd, 7'b0010011}, i.e. "GREVI 56". BSWAPW is either "GREVI 24" or
"GREVIW 24" depending on whether sign extension is to be preserved.
GREVIW is to SLLIW as GREVI is to SLLI.


-- Jacob

Michael Clark

unread,
Oct 29, 2017, 2:00:00 AM10/29/17
to jcb6...@gmail.com, RISC-V ISA Dev


> On 29/10/2017, at 6:30 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Michael Clark wrote:
>>> On 29/10/2017, at 5:01 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>>
>>> Critical path length will be more important than area in high-performance systems, especially now that systems are limited by substrate power dissipation capacity. Area, especially for circuits that are not active on every cycle, is becoming a sunk cost, while critical path length still limits operating frequency.
>>>
>>
>> Having separate functional units is a viable implementation approach for high performance processors. That was not the point of my diagram which was to satisfy those who are area constrained with one possible implementation approach.
>>
>
> It is a possible approach, but the trade-off is not always that simple to actually evaluate.
>
>> Separate opcodes in indeed required for the most high performance implementations if we want an ISA that doesn�ft presuppose or constrain implementation choices.
>>
>> Having separate opcodes for CLZ/CTZ doesn�ft dictate one implementation approach or the other. The point was more that a small area approach is viable while having distinct opcodes.
>>
>
> I never said that CLZ/CTZ should not be their own opcodes -- having only one source operand means that they have no meaningful immediate form and therefore will only take up space in the much less crowded OP/OP-32/OP-64 major opcodes rather than needing space in OP-IMM which is almost full. GREVI happens to have a small immediate that can fit alongside an immediate shift instruction.
>
>> One ideally wants to optimise for frequency. CLZ/CTZ/BSWAP will likely be used more frequently in typical code compared to GREVI unless we have a design that exposes or dictates one particular implementation approach e.g. CLZ32 on RV64 implemented in terms of SRL; GREVI 31; NEG; AND; ADDI -1; BCNT. IMHO that would be an ISA design mistake. It�fs too complex to be macro-fused.
>>
>
> The advantage for GREVI is that it provides one instruction the can perform both bit reverse and byte swap.

I added GREVI to the list I am maintaining.

> To be clear, unless hard minimalism is a goal for RVB, I support distinct CLZ/CTZ opcodes. I would prefer RVB to include, in descending order of importance: GREVI (which provides BREV, BSWAP, and more), POPCNT, rotate, CLZ, CTZ, and mask-shift instructions, for a total of 12 new RV32 instructions. The rotate and mask-shift instructions would be packed amongst the baseline shift opcodes, using the existing bit 30 to select arithmetic shift (as the baseline ISA does), bit 29 to select mask shift and the combination of bit 30 and bit 29 to select rotate. GREVI could be encoded as SLLI with bit 30 set, since "left arithmetic shift" makes no sense. This preserves bits 27 and 28 for future extensions and reserves space for GREV near SLL if that is needed in the future. POPCNT, CLZ, and CTZ all lack meaningful immediate forms, so can be encoded in OP/OP-32/OP-64 easily with new funct7 codes, or since these instructions have only a single source operand, one new funct7 code and using the rs2 field to select an opcode.
>
> AND-NOT and OR-NOT have been suggested, and could be easily encoded near the current AND and OR, but would not have immediate forms at all due to lack of encoding space. This would break an elegant symmetry in RISC-V. I omit them for this reason.
>
> Narrower forms of the rotate and mask-shift instructions are assumed to follow the pattern of the existing shift instructions adding 8 instructions each to RV64IB and RV128IB. Encoding GREV near shift, as I advocate, may make narrower GREV easier to implement as well, but GREV can easily be used with care to process narrower data. Maintaining sign extension of narrower values would require narrower forms of all of these to be implemented.
>
> This leaves POPCNT, CLZ, and CTZ. CTZ is inherently correct for narrower data, so not a problem provided that a result greater than the narrower word size is properly handled by software. Narrow CLZ can be handled with native GREVI and native CTZ, with the same constraint on handling the result. Narrow POPCNT requires masking the input, which can be done with an "SLMI / AND / POPCNT" sequence. The big question then is how frequently is narrow POPCNT actually used?

I maintain that W versions are required for all of them not only for consistency but to avoid additional canonicalisation e.g. ZEXT.W. Why skip W on just CTZ when the results tangibly differ depending on the content of the upper bits for all of the instructions. RISC doesn’t mean make extra work due to extreme minimalism. It means choosing the instructions emitted by the compiler. Unfortunately there is a lot of code that uses int32_t and uint32_t. We can’t depend on programmers changing their code to use variable width types. I’d say there would be a lot of 32-bit code that uses __builtin_popcount/__builtin_clz/__builtin_ctz and uint32_t vs __builtin_popcountll/__builtin_clzll/__builtin_ctzll and uint64. Having only 32-bit versions is essentially saying it needs to be more than one instruction. The compilers currently emit canonicalisation instructions when a 64-bit instruction is used on a 32-bit value.

I won’t go as far as saying that we should have ANDW/ORW/XORW/NEGW, but if we would save some engineering on REE (Redundant Extension Elimination). The state of the art in open source compilers are not aware that these instructions don’t affect canonicalisation of the upper bits. The case of CLZ/CTZ/POPCOUNT is much worse as they all have different results when used on ints i.e. require guarding, which makes int on RV64 require 2 or 3 instructions instead of one. Now that’s not the approach we would take for a high performance core. Fuse redundant instructions because we missed support for a particularly common width.

Michael Clark

unread,
Oct 29, 2017, 2:19:20 AM10/29/17
to jcb6...@gmail.com, RISC-V ISA Dev
OK good. I’ll study the opcode map for gaps in these places and come up with some candidates e.g. riscv-opcodes compatible metadata…. soonish… Unless someone has some coding space already planned that we could adopt…

I’m fine with GREVI if there is room for the immediate given that BSWAP and BREV essentially become pseudos. i.e. we can assemble BSWAP and BREV with pseudos as they already have compiler builtins.

I’d like to run some benchmarks and look at retired instructions and cycle estimates…

Sure we could rename BCLZ to CLZ, BCTZ to CTZ, BCNT to POPCOUNT, but I kind of liked the idea of these instructions naturally sorting together due to the use of a common B prefix…

Jacob Bachmeyer

unread,
Oct 29, 2017, 2:37:51 AM10/29/17
to Michael Clark, RISC-V ISA Dev
Michael Clark wrote:
>> On 29/10/2017, at 6:30 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> Michael Clark wrote:
>>
>>>> On 29/10/2017, at 5:01 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>>>
>>>>
>>>>
>> [...]
>> This leaves POPCNT, CLZ, and CTZ. CTZ is inherently correct for narrower data, so not a problem provided that a result greater than the narrower word size is properly handled by software. Narrow CLZ can be handled with native GREVI and native CTZ, with the same constraint on handling the result. Narrow POPCNT requires masking the input, which can be done with an "SLMI / AND / POPCNT" sequence. The big question then is how frequently is narrow POPCNT actually used?
>>
>
> I maintain that W versions are required for all of them not only for consistency but to avoid additional canonicalisation e.g. ZEXT.W. Why skip W on just CTZ when the results tangibly differ depending on the content of the upper bits for all of the instructions.

Narrow CTZ can be simply "CTZ", narrow CLZ can be "SLLI / CLZ", but
narrow POPCNT still requires explicit masking -- no, wait, "SLLI /
POPCNT" will work in all cases, since SLLI shifts in zeros and POPCNT
counts ones.

CTZ gives identical results to CTZW unless the input value is zero. A
"GREVI / CTZ" or "SLLI / CLZ" sequence is equivalent to CLZW, also
unless the input is zero.

> RISC doesn’t mean make extra work due to extreme minimalism. It means choosing the instructions emitted by the compiler. Unfortunately there is a lot of code that uses int32_t and uint32_t. We can’t depend on programmers changing their code to use variable width types. I’d say there would be a lot of 32-bit code that uses __builtin_popcount/__builtin_clz/__builtin_ctz and uint32_t vs __builtin_popcountll/__builtin_clzll/__builtin_ctzll and uint64. Having only 32-bit versions is essentially saying it needs to be more than one instruction. The compilers currently emit canonicalisation instructions when a 64-bit instruction is used on a 32-bit value.
>

But RISC also favors adapting the compiler to work better with a simpler
processor. I ask whether these issues are best solved by adding to the
ISA or by improving the compilers?

> I won’t go as far as saying that we should have ANDW/ORW/XORW/NEGW, but if we would save some engineering on REE (Redundant Extension Elimination). The state of the art in open source compilers are not aware that these instructions don’t affect canonicalisation of the upper bits.

So then we agree that some forms of this issue are best solved by
improving the compilers?

> The case of CLZ/CTZ/POPCOUNT is much worse as they all have different results when used on ints i.e. require guarding, which makes int on RV64 require 2 or 3 instructions instead of one. Now that’s not the approach we would take for a high performance core. Fuse redundant instructions because we missed support for a particularly common width.
>

Again, the only case where CTZ and CTZW would produce different results
is an input value of zero. The sequences for CLZW are similar. In both
of these cases, the problem is easily solved by using "greater-or-equal"
instead of "equal" when checking the result to determine if the input
was zero. Presumably, CTZW should return 32 for 32-bit zero, so the
only problem is that software must check that CTZ of (uint32_t) is >31
or >=32 instead of ==32. This is a transformation that a compiler
should be able to perform.

The question is how many programs would actually use CTZW/CLZW? Bignum
libraries, for example, would almost certainly rather use full machine
words than 32-bit values on a 64-bit machine. Are there other
applications that would use these instructions and not consider 32-bit
operations on a 64-bit processor a major performance-hampering bug? How
frequent are these cases? Frequency of expected use is important here.


-- Jacob

Jacob Bachmeyer

unread,
Oct 29, 2017, 2:51:09 AM10/29/17
to Michael Clark, RISC-V ISA Dev
Michael Clark wrote:
>> On 29/10/2017, at 6:58 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> Michael Clark wrote:
>>
>>>> On 29/10/2017, at 5:01 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>>>
>>>> Michael Clark wrote:
>>>>
>>>>
>>>>> Hi Folks,
>>>>>
>>>>> Bit manipulation instructions have been on my mind for a while as I am interesting in creating some encodings so that we can add candidate instructions to a branch of GCC so that we can start looking at stats such as dynamic retired instructions or cycle counts. I’m not yet a member of the BitManip Group in the riscv.org workgroup system so I’m posting here to ISA dev. I haven’t seen any public updates from the BitManip group, and was expecting there would be some open discussion on ISA-dev (a la IETF).
>>>>>
>>>>>
>>>> There were some discussions on ISA-dev before BitManip kicked off, but I share your disappointment. I was hoping that RISC-V development would follow the IETF model.
>>>>
>>>>
>>> There was a comment at the CARRV workshop that we need a RISC-V Weekly News much like the excellent example set by LLVM Weekly News. However the LLVM mailing lists are a bit of a firehose than the RISC-V mailing lists. That’s not to say there is not a lot going on in RISC-V land, it’s just we don’t necessarily find out about it all on the mailing lists.
>>>
>>> The issue is we need status updates shared more frequently with the mailing lists before something like a RISC-V Weekly News would be possible, and of course folk need to be able to share freely without academic pre-publication rules or proprietary industry NDAs hindering open discussion and/or updates.
>>>
>>> In any case, I think i’m best off to just find some opcode space and making a compiler patch and take it from there, rather than try to reach some sort of consensus before proceeding.
>>>
>>> These 9 will have easy lowering for a lot of existing code.
>>>
>>> RLL[W] rd,rs1,rs2 Rotate Left Logical Rotate bits in rs1 left by the amount in rs2
>>> RRL[W] rd,rs1,rs2 Rotate Right Logical Rotate bits in rs1 right by the amount in rs2
>>> RLLI[W] rd,rs1,shamt Rotate Left Logical Immediate Rotate bits in rs1 left by the immediate
>>> RRLI[W] rd,rs1,shamt Rotate Right Logical Immediate Rotate bits in rs1 right by the immediate
>>>
>>>
>> I have been proposing an encoding for these since February: like shift, but with bits 30 and 29 both set.
>>
>>
>>> BCLZ[W] rd,rs1 Bit Count Leading Zeros Count leading zero bits in rs1
>>> BCTZ[W] rd,rs1 Bit Count Trailing Zeros Count trailing zero bits in rs1
>>> BCNT[W] rd,rs1 Bit Count Count number of bits set in rs1
>>>
>>>
>> These can go near SLT, but with bit 30 set and a sub-function code (POPCNT/CTZ/CLZ) in the rs2 field. (Rationale for choosing SLT: SLT/SLTU are the only ALU operations that generate a value inside the ALU rather than deriving it bitwise from the inputs; POPCNT/CTZ/CLZ also generate a value inside the ALU with bits not directly traceable to operand bits.)
>>
>>
>>> BSWAP[W] rd,rs1 Byte Swap Swap byte order in rs1
>>>
>>>
>> I suggest making this a special case of GREVI, and encoding GREVI like SLLI, but with bit 30 set. The shamt field becomes GREVI's "k" parameter. As such, BSWAP on RV64 is {6'b010000, 6'd56, rs1, 3'b001, rd, 7'b0010011}, i.e. "GREVI 56". BSWAPW is either "GREVI 24" or "GREVIW 24" depending on whether sign extension is to be preserved. GREVIW is to SLLIW as GREVI is to SLLI.
>>
>
> OK good. I’ll study the opcode map for gaps in these places and come up with some candidates e.g. riscv-opcodes compatible metadata…. soonish… Unless someone has some coding space already planned that we could adopt…
>

Look a bit closer, I already did that search and gave ways to fit these
into those gaps. :-)

> I’m fine with GREVI if there is room for the immediate given that BSWAP and BREV essentially become pseudos. i.e. we can assemble BSWAP and BREV with pseudos as they already have compiler builtins.
>

That is the reason to pay GREV[I]'s hardware complexity -- it nicely
reduces two otherwise complex operations that we need to
pseudo-instructions. And yes, we should still have BSWAP and BREV
mnemonics, which the assembler expands to GREVI ops.

> I’d like to run some benchmarks and look at retired instructions and cycle estimates…
>
> Sure we could rename BCLZ to CLZ, BCTZ to CTZ, BCNT to POPCOUNT, but I kind of liked the idea of these instructions naturally sorting together due to the use of a common B prefix…
>

Sorting them together is nice, but these operations have
well-established names in computer science: "Count Leading Zeros",
"Count Trailing Zeros", "POPulation Count". I have seen POPCNT, POPC,
POPCOUNT for the latter. POPC is the shortest of these, but looks like
some kind of stack operation at first glance to an unfamiliar reader
("POP C register", perhaps?). POPCNT uses two three-letter
abbreviations, while POPCOUNT would be one of the longest mnemonics in
the RISC-V assembler.


-- Jacob

Michael Clark

unread,
Oct 29, 2017, 4:08:23 AM10/29/17
to jcb6...@gmail.com, RISC-V ISA Dev


> On 29/10/2017, at 7:37 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Narrow CTZ can be simply “CTZ", narrow CLZ can be "SLLI / CLZ", but narrow POPCNT still requires explicit masking -- no, wait, "SLLI / POPCNT" will work in all cases, since SLLI shifts in zeros and POPCNT counts ones.

You can have non canonical upper bits which change the results of POPCOUNT and CTZ (zero in lower 32-bits and non-canonical upper bits), and depending on the context may require canonicalisation, sign or zero extension depending on the context.

> CTZ gives identical results to CTZW unless the input value is zero. A "GREVI / CTZ" or "SLLI / CLZ" sequence is equivalent to CLZW, also unless the input is zero.

With canonicalisation it gives identical results. Sure then we can do SEXT.W or ZEXT.W and fuse the 3 instruction sequence if we go that way, or we can implicitly narrow like the other W instructions (there is precedence for 32-bit ops that work the same given zero upper bits such as logical right shift). It’s better when the canonicalisation is collapsed into the instruction due to native support for 32-bit int widths on 64-bit CPUs.

It adds sufficient complexity. Sure, a short passed to popcount will require canonicalisation, but not int on some architecture. e.g.

- https://cx.rv8.io/g/si1C99

int popcount_s(short v)
{
return __builtin_popcount(v);
}

int popcount_w(int v)
{
return __builtin_popcount(v);
}

int popcount_d(long long v)
{
return __builtin_popcountll(v);
}

One of the problems is that 32-bit int is a sufficiently popular type.

Michael Clark

unread,
Oct 29, 2017, 4:27:34 AM10/29/17
to jcb6...@gmail.com, RISC-V ISA Dev


> On 29/10/2017, at 9:08 PM, Michael Clark <michae...@mac.com> wrote:
>
>
>
>> On 29/10/2017, at 7:37 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> Narrow CTZ can be simply “CTZ", narrow CLZ can be "SLLI / CLZ", but narrow POPCNT still requires explicit masking -- no, wait, "SLLI / POPCNT" will work in all cases, since SLLI shifts in zeros and POPCNT counts ones.
>
> You can have non canonical upper bits which change the results of POPCOUNT and CTZ (zero in lower 32-bits and non-canonical upper bits), and depending on the context may require canonicalisation, sign or zero extension depending on the context.
>
>> CTZ gives identical results to CTZW unless the input value is zero. A "GREVI / CTZ" or "SLLI / CLZ" sequence is equivalent to CLZW, also unless the input is zero.
>
> With canonicalisation it gives identical results. Sure then we can do SEXT.W or ZEXT.W and fuse the 3 instruction sequence if we go that way, or we can implicitly narrow like the other W instructions (there is precedence for 32-bit ops that work the same given zero upper bits such as logical right shift). It’s better when the canonicalisation is collapsed into the instruction due to native support for 32-bit int widths on 64-bit CPUs.

To be more precise: 64-bit right shift on 32-bit ints that are below 2^31. In fact we don’t really need any of the W instructions because we can emit shifts after them and macro fuse them.

> It adds sufficient complexity. Sure, a short passed to popcount will require canonicalisation, but not int on some architecture. e.g.
>
> - https://cx.rv8.io/g/si1C99
>
> int popcount_s(short v)
> {
> return __builtin_popcount(v);
> }

There is no popcount intrinsic for shorts so it is counting canonicalised bits.

> int popcount_w(int v)
> {
> return __builtin_popcount(v);
> }

There is a popcount, ctz, clz intrinsic for int, and if passed a long it would need canonicalisation without a W variant.

> int popcount_d(long long v)
> {
> return __builtin_popcountll(v);
> }
>
> One of the problems is that 32-bit int is a sufficiently popular type.

The ops that lack 32-but versions are a bit of a headache and need considerable thought regarding correctness. The compiler errs on the side of safety with respect to emitting canonicalisation.

All of the instructions can have distinct results for the 32-bit variants with non-canonical inputs, and the W variants collapse the extra work (which could be sign or zero extension).

It’s a bit of a headache. Especially 64-bit and/or/xor/neg and current compiler tech…

Michael Clark

unread,
Oct 29, 2017, 4:56:58 AM10/29/17
to jcb6...@gmail.com, RISC-V ISA Dev
BTW I actually have a draft sitting around regarding a unique case for 64-bit GCC platforms that lack patterns for __builtin_ctz and __builtin_clz where the default lowering and libgcc versions of the intrinsics causes redundant work on RV64.

On 32-bit platforms, libgcc creates si (32-bit) and di (64-bit) versions of __builtin_popcount, __builtin_ctz and __builtin_clz.

(__popcountsi2, __popcountdi2, __ctzsi2, __ctzdi2, __clzsi2, __clzdi2)

On 64-bit platforms, libgcc creates di (64-bit) and ti (128-bit) versions of __builtin_popcount, __builtin_ctz and __builtin_clz.

(__popcountdi2, __popcountti2, __ctzdi2, __ctzti2, __clzdi2, __clzti2)

This seems to be a bug as other intrinsics have si versions on RV64 and these bitmanip builtins are only available for 32-bit and 64-bit integers. As far as I know there is no __builtin_popcount 128, __builtin_ctz128 and __builtin_clz128.

Due to this, on RV64, count leading zeros on 32-bit integers, spends 4 iterations counting 0 in the first 4 octets in it’s simple table clz implementation.

It doesn’t affect most platforms as they have machine code patterns for these intrinsics, it only affects 64-bit platforms that use libgcc.

This link shows the problem: https://cx.rv8.io/g/5t6qNe

I should probably raise a bug, or even better, submit a patch, as this affects performance on systems without B. It possibly affects gmp big int and soft float codes.
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/EDABFA2A-C35E-453F-B117-C4935DD3EAA9%40mac.com.

Clifford Wolf

unread,
Oct 29, 2017, 8:07:36 AM10/29/17
to Jacob Bachmeyer, Michael Clark, RISC-V ISA Dev
On Sat, Oct 28, 2017 at 11:01:57PM -0500, Jacob Bachmeyer wrote:
> I still oppose BEXT/BDEP/PEXT/PDEP for two main reasons: the
> enormous hardware complexity and the possible patent risks.

What enormous hardware complexity? BEXT and BDEP are each about as expensive as
GREV. So if BEXT/BDEP have "enormous hardware complexity" then so has GREV.

Re patents: I've implemented 19 BEXT/BDEP cores, some of them completely
avoiding butterfly circuits, or can be modified to doing so. I think there
is enough room for different implementations to avoid possible patent
risks.

Also: I think I know which patent you mean (US 8285766 B2). Hilewitz
himself published his method for BEXT/BDEP 4 years before filing the
patent.

Michael Clark

unread,
Oct 29, 2017, 2:59:53 PM10/29/17
to Clifford Wolf, Jacob Bachmeyer, RISC-V ISA Dev

> On 30/10/2017, at 1:07 AM, Clifford Wolf <clif...@clifford.at> wrote:
>
> On Sat, Oct 28, 2017 at 11:01:57PM -0500, Jacob Bachmeyer wrote:
>> I still oppose BEXT/BDEP/PEXT/PDEP for two main reasons: the
>> enormous hardware complexity and the possible patent risks.
>
> What enormous hardware complexity? BEXT and BDEP are each about as expensive as
> GREV. So if BEXT/BDEP have “enormous hardware complexity" then so has GREV.

I’ve updated the list i’m maintaining to include GREV and GREVI.

- https://github.com/rv8-io/rv8/blob/master/doc/src/bmi.md
- https://github.com/rv8-io/rv8/blob/master/doc/src/bitmanip.png

Note to Jacob: the diagram is for an “area efficient” implementation, not a high performance implementation. This is to ameliorate the concern of folk who are area constrained but want to implement B. A high performance implementation could have dedicated circuits for CLZ/CTZ/POPCOUNT if critical path length is an issue. Having separate opcodes is obviously a necessity for high performance implementations as it doesn’t pre-suppose any particular implementation approach, however it is good to know that an area efficient approach that re-uses functional units is possible.

> Re patents: I've implemented 19 BEXT/BDEP cores, some of them completely
> avoiding butterfly circuits, or can be modified to doing so. I think there
> is enough room for different implementations to avoid possible patent
> risks.
>
> Also: I think I know which patent you mean (US 8285766 B2). Hilewitz
> himself published his method for BEXT/BDEP 4 years before filing the
> patent.

There is prior art for the butterfly circuit in expired patents. patents that are butterfly circuit + foo aren’t necessarily solid patents.

https://encrypted.google.com/patents/US4689762

I remember searching for butterfly circuits and finding references to them dating back to the electro-mechanical telephone switches used in the 1960’s and potentially even earlier, possibly fully mechanical switches.

I think this discussion has been fruitful, at least for me, as i’m a little more aware of how this could all be implemented (i.e. BSWAP/BREV as pseudos instructions with k=24 and k=31).

Thanks,
Michael.

Michael Clark

unread,
Oct 29, 2017, 3:05:15 PM10/29/17
to Clifford Wolf, Jacob Bachmeyer, RISC-V ISA Dev


> On 30/10/2017, at 1:07 AM, Clifford Wolf <clif...@clifford.at> wrote:
>
> On Sat, Oct 28, 2017 at 11:01:57PM -0500, Jacob Bachmeyer wrote:
>> I still oppose BEXT/BDEP/PEXT/PDEP for two main reasons: the
>> enormous hardware complexity and the possible patent risks.
>
> What enormous hardware complexity? BEXT and BDEP are each about as expensive as
> GREV. So if BEXT/BDEP have "enormous hardware complexity" then so has GREV.

I think there is a lot of future potential for performing arbitrary bit permutations using BEXT+BDEP+GREVI.

Synthetic biology is what immediately comes to mind (nucleotide sequences require radix=4) but there are likely many potential future applications…

Allen J. Baum

unread,
Oct 29, 2017, 7:22:30 PM10/29/17
to Michael Clark, jcb6...@gmail.com, RISC-V ISA Dev
At 7:19 PM +1300 10/29/17, Michael Clark wrote:
>
> >
> > These can go near SLT, but with bit 30 set and a sub-function code (POPCNT/CTZ/CLZ) in the rs2 field. (Rationale for choosing SLT: SLT/SLTU are the only ALU operations that generate a value inside the ALU rather than deriving it bitwise from the inputs; POPCNT/CTZ/CLZ also generate a value inside the ALU with bits not directly traceable to operand bits.)

Huh? Even add/subtract don't generate outputs bitwise from inputs; each output bit has to look at all the lower precision inputs. SLT is effectively looking at just a single output bit (e.g. carryout or signs) which is a function of all the lower. Shift ops are not so great in that regard either.

>Sure we could rename BCLZ to CLZ, BCTZ to CTZ, BCNT to POPCOUNT, but I kind of liked the idea of these instructions naturally sorting together due to the use of a common B prefixŠ

I don't know why the common B prefix is important, or why its even a good thing.
Is it a good idea to sort them with branches? Should all ALUOPs start with "A"?It's much better to use Mnemonics that people (and code) are familiar with

GREVI:
>> The question is how often will GREVI be used where k is not 31 or 24.
>
>potentially a lot if you do things like arbitrary bit permutations:
>http://nbviewer.jupyter.org/url/svn.clifford.at/handicraft/2017/permsyn/data.ipynb

I was wondering the same thing - and I don't think this permutation result answers the question. If arbitrary permutations are very high on the list of apps we will run, sure - but overwhelmingly they're not. And, for the special cases: generally you'll have custom HW that will do exactly those permutations (e.g. crypto apps). Instead of shrinking from average 35-ish ops to 20-ish ops , it shrinks it to 1 op in those cases.

While the addition of BEXT and BDEP shrinks the number of ops, it isn't clear how much it shrinks latency, unless they have latency equivalent to an adder (so, single cycle).

My understanding was that the low-cost implementations were multicycle, or at least very high latency. Is there a summary of gate delays vs. number of gates for the cores you've looked at anywhere? When I've looked at it in the (extremely distant) past, it was roughly a gate delay per bit of width (in reality worse, since a 2-2 AOI mux has more delay than a 2in NAND).


>Clifford Wolf <clif...@clifford.at> wrote:
>
>Synthetic biology is what immediately comes to mind (nucleotide sequences require radix=4) but there are likely many potential future applicationsŠ

Again: if this a really important app, then a synthetic biology coprocessor would make more sense.

From a practical point of view, of course you're saving opcode space by specializing - but not saving any significant amount of HW, so really the drawback is can you justify the 62 (or maybe 124 if both RV64 and RV32 are counted) opcodes instead encoding just the 2 that are frequent?

Is there any other single fixed permutation that (combined with rotate) give you permuation functionality? (I'm thinking of a shuffle or unshuffle permutation, which collects all the odd numbered bits and concats with the even numbered bits, or the reverse)


--
**************************************************
* Allen Baum tel. (908)BIT-BAUM *
* 248-2286 *
**************************************************

Michael Clark

unread,
Oct 29, 2017, 7:58:52 PM10/29/17
to Allen Baum, jcb6...@gmail.com, RISC-V ISA Dev


> On 30/10/2017, at 12:22 PM, Allen J. Baum <allen...@esperantotech.com> wrote:
>
> At 7:19 PM +1300 10/29/17, Michael Clark wrote:
>>
>>>
>>> These can go near SLT, but with bit 30 set and a sub-function code (POPCNT/CTZ/CLZ) in the rs2 field. (Rationale for choosing SLT: SLT/SLTU are the only ALU operations that generate a value inside the ALU rather than deriving it bitwise from the inputs; POPCNT/CTZ/CLZ also generate a value inside the ALU with bits not directly traceable to operand bits.)
>
> Huh? Even add/subtract don't generate outputs bitwise from inputs; each output bit has to look at all the lower precision inputs. SLT is effectively looking at just a single output bit (e.g. carryout or signs) which is a function of all the lower. Shift ops are not so great in that regard either.
>
>> Sure we could rename BCLZ to CLZ, BCTZ to CTZ, BCNT to POPCOUNT, but I kind of liked the idea of these instructions naturally sorting together due to the use of a common B prefixŠ
>
> I don't know why the common B prefix is important, or why its even a good thing.
> Is it a good idea to sort them with branches? Should all ALUOPs start with "A"?It's much better to use Mnemonics that people (and code) are familiar with
>
> GREVI:
>>> The question is how often will GREVI be used where k is not 31 or 24.
>>
>> potentially a lot if you do things like arbitrary bit permutations:
>> http://nbviewer.jupyter.org/url/svn.clifford.at/handicraft/2017/permsyn/data.ipynb
>
> I was wondering the same thing - and I don't think this permutation result answers the question. If arbitrary permutations are very high on the list of apps we will run, sure - but overwhelmingly they're not. And, for the special cases: generally you'll have custom HW that will do exactly those permutations (e.g. crypto apps). Instead of shrinking from average 35-ish ops to 20-ish ops , it shrinks it to 1 op in those cases.
>
> While the addition of BEXT and BDEP shrinks the number of ops, it isn't clear how much it shrinks latency, unless they have latency equivalent to an adder (so, single cycle).
>
> My understanding was that the low-cost implementations were multicycle, or at least very high latency. Is there a summary of gate delays vs. number of gates for the cores you've looked at anywhere? When I've looked at it in the (extremely distant) past, it was roughly a gate delay per bit of width (in reality worse, since a 2-2 AOI mux has more delay than a 2in NAND).
>
>
>> Clifford Wolf <clif...@clifford.at> wrote:
>>
>> Synthetic biology is what immediately comes to mind (nucleotide sequences require radix=4) but there are likely many potential future applicationsŠ
>
> Again: if this a really important app, then a synthetic biology coprocessor would make more sense.

I don’t necessarily think application specific coprocessors are realistic in that segment. The reality of today’s supercomputers is that they are dominated by commodity x86 and commodity GPUs made commodity through the economies of scale of gaming. I guess it is a relatively small market. Synthetic biology is just what immediately came to mind given BLAST is referenced in the papers on Parallel Bit Extract and Parallel Bit Deposit. They get to use instructions that have been made available to them. Tensor Processing is marketing buzzword for vectors of lower precision floats (half float in GPUs were first added for mobile device GPUs like android and iPhone devices).

BEXT/BDEP would suit access to an array of bitfield members in a loop, with the masks hoisted out of the loop, or more generally can “accelerate compressing and expanding selections of bits.” The instructions have applications in compression, cryptography as well as computational biology.

They could potentially be in an optional profile. They are sufficiently new that not all possible applications have been explored.

These 9 will have easy lowering for a lot of existing code. I don’t think there should be too much contention regarding this set for anyone who does some analysis of intrinsics that are actually used by current day code and compilers (the observation made by the original RISC authors). We can bike-shed on the naming…

RLL[W] rd,rs1,rs2 Rotate Left Logical Rotate bits in rs1 left by the amount in rs2
RRL[W] rd,rs1,rs2 Rotate Right Logical Rotate bits in rs1 right by the amount in rs2
RLLI[W] rd,rs1,shamt Rotate Left Logical Immediate Rotate bits in rs1 left by the immediate
RRLI[W] rd,rs1,shamt Rotate Right Logical Immediate Rotate bits in rs1 right by the immediate
BCLZ[W] rd,rs1 Bit Count Leading Zeros Count leading zero bits in rs1
BCTZ[W] rd,rs1 Bit Count Trailing Zeros Count trailing zero bits in rs1
BCNT[W] rd,rs1 Bit Count Count number of bits set in rs1
BSWAP[W] rd,rs1 Byte Swap Swap byte order in rs1

These 3 are interesting but I don’t have any benchmarks that can use them immediately without some porting effort. They could be in their own profile.

BEXT[W] rd,rs1,rs2 Bit Extract Gather LSB justified bits to rd from rs1 using extract mask in rs2
BDEP[W] rd,rs1,rs2 Bit Deposit Scatter LSB justified bits from rs2 to rd using deposit mask in rs2
GREVI[W] rd,rs1,imm7 Generalized Reverse Generalized Reverse of bits in rs1 accoring to mask in imm7

I’d be happy with the first set of 9. They have practical uses today and are in use in a lot of code.

Michael Clark

unread,
Oct 29, 2017, 8:04:22 PM10/29/17
to Allen Baum, jcb6...@gmail.com, RISC-V ISA Dev


> On 30/10/2017, at 12:22 PM, Allen J. Baum <allen...@esperantotech.com> wrote:
>
>> Sure we could rename BCLZ to CLZ, BCTZ to CTZ, BCNT to POPCOUNT, but I kind of liked the idea of these instructions naturally sorting together due to the use of a common B prefixŠ
>
> I don't know why the common B prefix is important, or why its even a good thing.
> Is it a good idea to sort them with branches? Should all ALUOPs start with "A"?It's much better to use Mnemonics that people (and code) are familiar with

We put Shifts and Branches together, why not Bit manipulation instructions?:

- RSA - Right Shift Arithmetic
- RSL - Right Shift Logical
- LSL - Left Shift Logical

IMHO It’s actually a nice feature to have things grouped logically. It makes reading easier.

Samuel Falvo II

unread,
Oct 29, 2017, 8:08:57 PM10/29/17
to Allen J. Baum, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
This was originally intended to be "reply-all". I accidentally
originally sent this to just Allen by mistake. My apologies for the
spam, Allen.

On Sun, Oct 29, 2017 at 4:22 PM, Allen J. Baum
<allen...@esperantotech.com> wrote:
> generally you'll have custom HW that will do exactly those permutations (e.g. crypto apps).

Forgive me if I'm stating the obvious. I'm responding only because I
don't want to see RISC-V's official ISA pigeon-holed.

I would like to see feature parity with the like of POWER-architecture
processors, at least. Any bit-manipulation instruction set extension
should, at the very least, facilitate these jobs:

1) RISC-V emulator instruction decode performance,
2) 2D graphics manipulation (e.g., bit-blits).
3) Constant-time cryptography.

Security is something of a "big thing" these days; without suggesting
we cater specifically to that domain, I also think it's short-sighted
to design an instruction set that makes this more difficult than
necessary. Despite modern-day x86 and ARM processors having
instructions for specific classes of cryptography, they obviously
aren't universal, as blog posts about how certain ways of writing
software results in non-constant-time operations. Also, new crypto
algorithms are being researched all the time, so baking some
instructions for AES today, and Elliptical curve tomorrow, etc. will
result in vastly more ISA pollution than just biting the bullet and
adding support for generalized bit-manipulation instructions today.

Further, with respect to depending on the presence of custom HW, this
assumes you're always using RISC-V for embedded applications. This is
an invalid assumption: *I* intend on using them as general purpose
CPUs for desktop-class computers, for instance. I know, I'm in the
minority. But, still, I'm an existence proof.

>>Synthetic biology is what immediately comes to mind (nucleotide sequences require radix=4) but there are likely many potential future applicationsŠ
>
> Again: if this a really important app, then a synthetic biology coprocessor would make more sense.

One of the earlier RISC-V Workshops I've attended had about *half* of
its attendance from engineers looking at the possibility of using
RISC-V processors for supercomputer construction. They covered
everything from large-scale database engines to scientific computing
clusters the size of buildings. It is foolish to say, "If this is a
really important app." I don't think it's for you to decide. The ISA
should be as orthogonal as possible; do not prematurely optimize. You
can never tell who is going to want to use RISC-V going forward.

Just my $0.02.

--
Samuel A. Falvo II

Jacob Bachmeyer

unread,
Oct 29, 2017, 8:39:42 PM10/29/17
to Michael Clark, RISC-V ISA Dev
Michael Clark wrote:
>> On 29/10/2017, at 7:37 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> Narrow CTZ can be simply “CTZ", narrow CLZ can be "SLLI / CLZ", but narrow POPCNT still requires explicit masking -- no, wait, "SLLI / POPCNT" will work in all cases, since SLLI shifts in zeros and POPCNT counts ones.
>>
>
> You can have non canonical upper bits which change the results of POPCOUNT and CTZ (zero in lower 32-bits and non-canonical upper bits), and depending on the context may require canonicalisation, sign or zero extension depending on the context.

A "SLLI 32 / POPCNT" sequence solves that problem for POPCNT -- the
shift shifts in 32 zeros in the *low* bits, moving the data to the high
half. A "SLLI 32 / CLZ" sequence behaves similarly, but CLZ returns
XLEN instead of 32 if the input word is zero. CTZ has the same
problem. This is solvable by either ensuring the input is non-zero or
ensuring that results of 32 and XLEN are treated as equivalent.


-- Jacob

Jacob Bachmeyer

unread,
Oct 29, 2017, 8:45:22 PM10/29/17
to Michael Clark, Allen Baum, RISC-V ISA Dev
Michael Clark wrote:
> BEXT/BDEP [...] They are sufficiently new that not all possible
> applications have been explored.

This is why I have been repeatedly expressing concerns about patents on
PEXT/PDEP. "Sufficiently new" is a serious cause for concern here.

-- Jacob

Jacob Bachmeyer

unread,
Oct 29, 2017, 8:58:27 PM10/29/17
to Allen J. Baum, Michael Clark, RISC-V ISA Dev
Allen J. Baum wrote:
> At 7:19 PM +1300 10/29/17, Michael Clark wrote:
>
>>> These can go near SLT, but with bit 30 set and a sub-function code (POPCNT/CTZ/CLZ) in the rs2 field. (Rationale for choosing SLT: SLT/SLTU are the only ALU operations that generate a value inside the ALU rather than deriving it bitwise from the inputs; POPCNT/CTZ/CLZ also generate a value inside the ALU with bits not directly traceable to operand bits.)
>>>
>
> Huh? Even add/subtract don't generate outputs bitwise from inputs; each output bit has to look at all the lower precision inputs. SLT is effectively looking at just a single output bit (e.g. carryout or signs) which is a function of all the lower. Shift ops are not so great in that regard either.
>

Fair enough, but an adder has a structure that clearly aligns input to
output. SLT examines inputs and generates a (1-bit) output value that
depends on the entire input. POPCNT/CTZ/CLZ examine their inputs and
generate a (log2(XLEN)-bit) output value that depends on the entire
input. To me, this seems like the most logical place to add them would
therefore be SLT/SLTU. I am not dead set on this if someone can explain
how those operations make more sense grouped with other hardware. For
example, if POPCNT can be more easily implemented by adding a few gates
to an adder than by modifying the comparator used by SLT or adding a new
functional block, I would support encoding POPCNT near ADD.

> GREVI:
>
>>> The question is how often will GREVI be used where k is not 31 or 24.
>>>
>> potentially a lot if you do things like arbitrary bit permutations:
>> http://nbviewer.jupyter.org/url/svn.clifford.at/handicraft/2017/permsyn/data.ipynb
>>
>
> I was wondering the same thing - and I don't think this permutation result answers the question. If arbitrary permutations are very high on the list of apps we will run, sure - but overwhelmingly they're not. And, for the special cases: generally you'll have custom HW that will do exactly those permutations (e.g. crypto apps). Instead of shrinking from average 35-ish ops to 20-ish ops , it shrinks it to 1 op in those cases.
>

Arbitrary permutations may not be high on the list, but I like the idea
of fewer, more flexible instructions -- GREV gives us both BSWAP and
BREV as special cases and all other possible permutations, while
conveniently fitting into a corner of the opcode space that is otherwise
very hard to use.

> While the addition of BEXT and BDEP shrinks the number of ops, it isn't clear how much it shrinks latency, unless they have latency equivalent to an adder (so, single cycle).
>
> My understanding was that the low-cost implementations were multicycle, or at least very high latency. Is there a summary of gate delays vs. number of gates for the cores you've looked at anywhere? When I've looked at it in the (extremely distant) past, it was roughly a gate delay per bit of width (in reality worse, since a 2-2 AOI mux has more delay than a 2in NAND).
>

I continue to oppose PEXT/PDEP/BEXT/BDEP until someone addresses my
concerns about patents on these operations. I want to be sure that
RISC-V does not gleefully stomp on a giant patent landmine here.

>> Clifford Wolf <clif...@clifford.at> wrote:
>>
>> Synthetic biology is what immediately comes to mind (nucleotide sequences require radix=4) but there are likely many potential future applicationsŠ
>>
>
> Again: if this a really important app, then a synthetic biology coprocessor would make more sense.
>

Agreed that specific applications like that probably will benefit from
specialized coprocessors.

> From a practical point of view, of course you're saving opcode space by specializing - but not saving any significant amount of HW, so really the drawback is can you justify the 62 (or maybe 124 if both RV64 and RV32 are counted) opcodes instead encoding just the 2 that are frequent?
>

The benefit of GREVI here is that it can be encoded as a "left
arithmetic shift", neatly filling an otherwise difficult-to-use corner
of the opcode space. I propose that GREVI be to SLLI as SRAI is to SRLI
-- setting bit 30 in an SLLI instruction produces a GREVI instruction.


-- Jacob

Jacob Bachmeyer

unread,
Oct 29, 2017, 9:05:21 PM10/29/17
to Clifford Wolf, Michael Clark, RISC-V ISA Dev
Clifford Wolf wrote:
> On Sat, Oct 28, 2017 at 11:01:57PM -0500, Jacob Bachmeyer wrote:
>
>> I still oppose BEXT/BDEP/PEXT/PDEP for two main reasons: the
>> enormous hardware complexity and the possible patent risks.
>>
>
> What enormous hardware complexity? BEXT and BDEP are each about as expensive as
> GREV. So if BEXT/BDEP have "enormous hardware complexity" then so has GREV.
>

Can the same hardware block implement both of them?

> Re patents: I've implemented 19 BEXT/BDEP cores, some of them completely
> avoiding butterfly circuits, or can be modified to doing so. I think there
> is enough room for different implementations to avoid possible patent
> risks.
>

The concern I have is the possibility of a patent on the instruction
itself, as I understand there were patents on MIPS unaligned load/store
instructions at one time. Are we sure that Intel has no patent that
could trip us up?

> Also: I think I know which patent you mean (US 8285766 B2). Hilewitz
> himself published his method for BEXT/BDEP 4 years before filing the
> patent.
>

I do not know of any specific patents on the matter and was expressing a
generalized concern. While I (not-a-lawyer) understand that publishing
a method 4 years before filing a patent probably invalidates the patent,
surely the Foundation has lawyers for just this type of problem that can
give a solid answer. Could there be other patents that could bite us on
these instructions? "Submarine patents" are my big concern here.


-- Jacob

Jacob Bachmeyer

unread,
Oct 29, 2017, 11:18:57 PM10/29/17
to Michael Clark, Clifford Wolf, RISC-V ISA Dev
Michael Clark wrote:
> Note to Jacob: the diagram is for an “area efficient” implementation, not a high performance implementation. This is to ameliorate the concern of folk who are area constrained but want to implement B. A high performance implementation could have dedicated circuits for CLZ/CTZ/POPCOUNT if critical path length is an issue. Having separate opcodes is obviously a necessity for high performance implementations as it doesn’t pre-suppose any particular implementation approach, however it is good to know that an area efficient approach that re-uses functional units is possible.
>

:-)

There are other ALU topologies possible as well. Your earlier message
just seemed cavalier about trading area for critical path length.

> [...]
> I think this discussion has been fruitful, at least for me, as i’m a little more aware of how this could all be implemented (i.e. BSWAP/BREV as pseudos instructions with k=24 and k=31).
>

Which leads to another option: the full GREVI could be an optional "RVB
level 2" instruction, with the special cases for BSWAP/BREV being the
opcodes for those operations on implementations that lack the full
GREVI. Allen Baum has mentioned that bit reversal and byte swap can be
easily integrated into a barrel shifter, but that GREV requires quite a
bit more hardware. Or implementations could subset RVB, implementing
BSWAP in hardware and full GREV using a monitor trap.


-- Jacob

Allen J. Baum

unread,
Oct 30, 2017, 12:48:20 AM10/30/17
to Samuel Falvo II, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
At 5:08 PM -0700 10/29/17, Samuel Falvo II wrote:
This was originally intended to be "reply-all".  I accidentally
originally sent this to just Allen by mistake.  My apologies for the
spam, Allen.

no problem

On Sun, Oct 29, 2017 at 4:22 PM, Allen J. Baum
<allen...@esperantotech.com> wrote:
> generally you'll have custom HW that will do exactly those permutations (e.g. crypto apps).

Forgive me if I'm stating the obvious.  I'm responding only because I
don't want to see RISC-V's official ISA pigeon-holed.

I would like to see feature parity with the like of POWER-architecture
processors, at least.  Any bit-manipulation instruction set extension
should, at the very least, facilitate these jobs:

1) RISC-V emulator instruction decode performance,
2) 2D graphics manipulation (e.g., bit-blits).
3) Constant-time cryptography.

If you really want to do 1&2, then what you want are not BEXT/BDEP, but BFEXT/BFDEP (BitField) as implemented in PA-RISC, IBM801, and others.

Constant time cryptography has a lot of requirements that are pretty orthogonal to BEXT/BDEP; it isn't clear to me that those ops significantly help constant time permutation, in any case - but perhaps. General GF2 ops are probably just as important (cf MicroUnity, whose patents are starting to run out) as are multi-precision primitives (e.g. add with carry (yea, I know there is no carry bit), multiply upper, double width (2in-1out) shift (which you can get for free if you're already implementing rotates ))

Security is something of a "big thing" these days; without suggesting
we cater specifically to that domain, I also think it's short-sighted
to design an instruction set that makes this more difficult than
necessary.  Despite modern-day x86 and ARM processors having
instructions for specific classes of cryptography, they obviously
aren't universal, as blog posts about how certain ways of writing
software results in non-constant-time operations.  Also, new crypto
algorithms are being researched all the time, so baking some
instructions for AES today, and Elliptical curve tomorrow, etc. will
result in vastly more ISA pollution than just biting the bullet and
adding support for generalized bit-manipulation instructions today.

Further, with respect to depending on the presence of custom HW, this
assumes you're always using RISC-V for embedded applications.  This is
an invalid assumption: *I* intend on using them as general purpose
CPUs for desktop-class computers, for instance.  I know, I'm in the
minority.  But, still, I'm an existence proof.

Even desktop class computers benefit from dedicated crypto HW; if its something that's needed a lot, it will lower power and increase performance.


>>Synthetic biology is what immediately comes to mind (nucleotide sequences require radix=4) but there are likely many potential future applications·

>
> Again: if this a really important app, then a synthetic biology coprocessor would make more sense.

One of the earlier RISC-V Workshops I've attended had about *half* of
its attendance from engineers looking at the possibility of using
RISC-V processors for supercomputer construction.  They covered
everything from large-scale database engines to scientific computing
clusters the size of buildings.  It is foolish to say, "If this is a
really important app."  I don't think it's for you to decide.  The ISA
should be as orthogonal as possible; do not prematurely optimize.  You
can never tell who is going to want to use RISC-V going forward.

Supercomputers that are targeted towards specific apps (neural nets, sythetic biology or chemistry, physics simulations, graphics rendering, weather forecasting, database) benefit even more from custom targeted functional units.
The RISC-V ecosystem will enable custom co-processors more easily than ever before.



Just my $0.02.

--
Samuel A. Falvo II


Allen J. Baum

unread,
Oct 30, 2017, 1:01:51 AM10/30/17
to jcb6...@gmail.com, Michael Clark, Clifford Wolf, RISC-V ISA Dev
While we're at it: aren't BGATHER and BSCATTER much more obvious as
to their functionality than BEXT/BDEP (or PEXT/PDEP)?
Extract and Deposit have other meaning in other architectures, and
this just is confusing.

Jacob Bachmeyer

unread,
Oct 30, 2017, 1:59:52 AM10/30/17
to Allen J. Baum, Samuel Falvo II, Michael Clark, RISC-V ISA Dev
Allen J. Baum wrote:
> At 5:08 PM -0700 10/29/17, Samuel Falvo II wrote:
>
>> On Sun, Oct 29, 2017 at 4:22 PM, Allen J. Baum
>> <allen...@esperantotech.com> wrote:
>> > generally you'll have custom HW that will do exactly those
>> permutations (e.g. crypto apps).
>>
>> Forgive me if I'm stating the obvious. I'm responding only because I
>> don't want to see RISC-V's official ISA pigeon-holed.
>>
>> I would like to see feature parity with the like of POWER-architecture
>> processors, at least. Any bit-manipulation instruction set extension
>> should, at the very least, facilitate these jobs:
>>
>> 1) RISC-V emulator instruction decode performance,
>> 2) 2D graphics manipulation (e.g., bit-blits).
>> 3) Constant-time cryptography.
>
> If you really want to do 1&2, then what you want are not BEXT/BDEP,
> but BFEXT/BFDEP (BitField) as implemented in PA-RISC, IBM801, and others.

I had suggested these back in February, but limited available encoding
space poses a serious problem for implementing them on RISC-V. The
approach I offered was to use a control word in a register, but that had
problems of its own.

> Constant time cryptography has a lot of requirements that are pretty
> orthogonal to BEXT/BDEP; it isn't clear to me that those ops
> significantly help constant time permutation, in any case - but
> perhaps. General GF2 ops are probably just as important (cf
> MicroUnity, whose patents are starting to run out) as are
> multi-precision primitives (e.g. add with carry (yea, I know there is
> no carry bit), multiply upper, double width (2in-1out) shift (which
> you can get for free if you're already implementing rotates ))

General GF2 ops sound like a candidate for another extension after those
patents expire. :-)

Add-with-carry is something I have advocated for RVV, since not only
will a vector unit almost certainly be constant-time but is also
well-suited for bignum calculations. I proposed an approach using
additional carry chain vectors in message-id
<57EB10A7...@gmail.com>
<URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/57EB10A7.1000801%40gmail.com>
during a discussion that also included a search for macro-op fusion
candidates for an "add-with-carry".

How does "multiply upper" differ from MULH? I do not understand.

Double-shift would be nice, but simply does not fit in the RISC-V
operation structure (2R1W). Using immediates for the shift parameters
seems to fit, but there is no room to encode them in the base 32-bit space.

>> Security is something of a "big thing" these days; without suggesting
>> we cater specifically to that domain, I also think it's short-sighted
>> to design an instruction set that makes this more difficult than
>> necessary. Despite modern-day x86 and ARM processors having
>> instructions for specific classes of cryptography, they obviously
>> aren't universal, as blog posts about how certain ways of writing
>> software results in non-constant-time operations. Also, new crypto
>> algorithms are being researched all the time, so baking some
>> instructions for AES today, and Elliptical curve tomorrow, etc. will
>> result in vastly more ISA pollution than just biting the bullet and
>> adding support for generalized bit-manipulation instructions today.
>>
>> Further, with respect to depending on the presence of custom HW, this
>> assumes you're always using RISC-V for embedded applications. This is
>> an invalid assumption: *I* intend on using them as general purpose
>> CPUs for desktop-class computers, for instance. I know, I'm in the
>> minority. But, still, I'm an existence proof.
>
> Even desktop class computers benefit from dedicated crypto HW; if its
> something that's needed a lot, it will lower power and increase
> performance.

But performant software crypto is also important: otherwise what
happens when the current algorithms need to be replaced?


-- Jacob

Clifford Wolf

unread,
Oct 30, 2017, 7:04:27 AM10/30/17
to Jacob Bachmeyer, Michael Clark, Allen Baum, RISC-V ISA Dev
Responding to a couple of things in this thread..
On Sun, Oct 29, 2017 at 08:05:17PM -0500, Jacob Bachmeyer wrote:
>> Also: I think I know which patent you mean (US 8285766 B2). Hilewitz
>> himself published his method for BEXT/BDEP 4 years before filing the
>> patent.
>
> I do not know of any specific patents on the matter and was
> expressing a generalized concern. While I (not-a-lawyer) understand
> that publishing a method 4 years before filing a patent probably
> invalidates the patent, surely the Foundation has lawyers for just
> this type of problem that can give a solid answer. Could there be
> other patents that could bite us on these instructions? "Submarine
> patents" are my big concern here.

I think you completely overestimate how far the B extension WG is in their
process. It's not like we are going to release a final B spec tomorrow and
*forgot* to check things like that.

Nobody makes a decision now if BEXT/BDEP will be included or not in a final
spec. If we include them in the first draft spec that just means we are in
a position to be able to evaluate their utility, by adding experimental
compiler support, and running those compilers on real bit manipulation
code, etc. Then we have real data to base a decision on.

If we exclude them now then none of those things will happen and we will
never know if it would have been a good idea to include them.


And further regarding the utility of those instructions:

First, it's not like we are the first to suggest those instructions. If you
have an Intel Haswell or AMD Excavator processor then you have BMI2, which
means your x86 processor supports PDEP and PEXT today.

Second, there are operations that are known to be common in bit
manipulation code, such as shuffle/unshuffle and sheep-and-goats, that can
easily be implemented with BEXT/BDEP (+GREV for sheep-and-goats). So it's
not like we are inventing entirely new problems here to justify BEXT/BDEP.
There are a bunch of existing problems that BEXT/BDEP can solve.

Finally, if we don't include BEXT/BDEP then there is the question if we
should include weaker versions of it, like a bit field extract instruction
(such as BEXTR on x86). Those would also require hardware resources. This
must be taken into consideration when arguing about the cost vs. benefits
of BEXT/BDEP.


And finally regarding that B-suffix discussions:

Just to be absolutely clear: I never meant to imply that the B extension
instructions would all start with the letter B and tbh I find the idea
ridiculous. I just substituted the 'P' with a 'B' in PEXT/PDEP because the
P stands for parallel, which means it describes the concrete method used to
implement the instruction, which is just really bad instruction naming imo.
Some implementations would decide to use a non-parallel implementation of
the instruction and that would be perfectly fine.

There are also different methods to implement a divider for example, but we
don't imply any specific implementation with the DIV/DIVU/REM/REMU
instructions either, so why should we do that with BEXT/BDEP.

Rogier Brussee

unread,
Oct 30, 2017, 10:16:45 AM10/30/17
to RISC-V ISA Dev


Op zaterdag 28 oktober 2017 20:19:08 UTC+2 schreef michaeljclark:
Hi Folks,

Bit manipulation instructions have been on my mind for a while as I am interesting in creating some encodings so that we can add candidate instructions to a branch of GCC so that we can start looking at stats such as dynamic retired instructions or cycle counts. I’m not yet a member of the BitManip Group in the riscv.org workgroup system so I’m posting here to ISA dev. I haven’t seen any public updates from the BitManip group, and was expecting there would be some open discussion on ISA-dev (a la IETF).

Good inititiative! 

[snip]
 
## Rotate

RLL[W] rd,rs1,rs2                Rotate Left Logical                                Rotate bits in rs1 left by the amount in rs2
RRL[W] rd,rs1,rs2                Rotate Right Logical                        Rotate bits in rs1 right by the amount in rs2
RLLI[W] rd,rs1,shamt        Rotate Left Logical Immediate        Rotate bits in rs1 left by the immediate
RRLI[W] rd,rs1,shamt        Rotate Right Logical Immediate        Rotate bits in rs1 right by the immediate

Rotate instructions can be implemented very easily with the addition of carry-out carry-in logic to an existing shifter. Rotates are very simple instructions but they are frequently used in cryptographic ciphers so the small saving in cycles (3:1) is likely worth the additional area for processors that implement the B extension.




I don't see why one would need both a left and right rotate. E.g. RRLI rd rs1 shamt  gives exactly the same result as RLLI rd rs1 (XLEN - shamt) 

 
## Byte swap

BSWAP[W] rd,rs1                Byte Swap                                        Swap byte order in rs1

Byte swapping instructions are essential for networking code and cryptographic ciphers which typically use big endian formats. A 32-bit byte swap takes ~14 instructions and a 64-bit byte swap takes ~30 instructions.



Even BSWAPH might be worse it as it is used so extensively in networking code.  I guess that could be a GREVI mnemonic too. 

 
## Count Leading and Trailing Zeros

BCLZ[W] rd,rs1                Bit Count Leading Zeros                        Count leading zero bits in rs1
BCTZ[W] rd,rs1                Bit Count Trailing Zeros                        Count trailing zero bits in rs1
BCNT[W] rd,rs1                Bit Count                                                Count number of bits set in rs1
BREV[W] rd,rs1                Bit Reverse                                        Reverse bits in rs1

Count leading and trailing zeros can be constructed using popcount (BCNT) and bit reverse (BREV) instructions or functional units. Given the length of the instruction sequences and the possibility to multiplex via shared bit reverse and popcount functional units, the count leading and trailing zeros instructions are included.

 
clz is really important in softfloat, so may actually be used relatively often in small implementations.


## Parallel Bit Extract and Deposit

BEXT[W] rd,rs1,rs2                Bit Extract                                        Gather LSB justified bits to rd from rs1 using extract mask in rs2
BDEP[W] rd,rs1,rs2                Bit Deposit                                        Scatter LSB justified bits from rs2 to rd using deposit mask in rs2

Bit Extract

Gather LSB justified bits to rd from rs1 using extract mask in rs2. Bits are extracted from the source register using bit positions enabled in the mask register and are placed in the result as popcount(mask) LSB contiguous bits.

Register        Value
source        0b11110100
mask        0b01100011
result        0b00001100

Bit Deposit

Scatter LSB justified bits from rs2 to rd using deposit mask in rs2. Bits are deposited from the source register using popcount(mask) LSB contiguous bits and are placed in the result at the bit positions enabled in the mask register.

Register        Value
source        0b11110100
mask        0b01100011
result        0b00100000

Rogier Brussee

unread,
Oct 30, 2017, 10:26:57 AM10/30/17
to RISC-V ISA Dev, michae...@mac.com, jcb6...@gmail.com

AND-NOT and OR-NOT have been suggested, and could be easily encoded near
the current AND and OR, but would not have immediate forms at all due to
lack of encoding space. This would break an elegant symmetry in RISC-V.
I omit them for this reason.


There is no SUBI either, and for the same reason: just like you can do 
an ANDI with a negated immediate, you can do an AND with a complemented immediate
if you need a ANDNOTI 

-- Jacob

Rogier Brussee

unread,
Oct 30, 2017, 10:32:31 AM10/30/17
to RISC-V ISA Dev, jcb6...@gmail.com
[snip]
 
I maintain that W versions are required for all of them not only for consistency but to avoid additional canonicalisation e.g. ZEXT.W. Why skip W on just CTZ when the results tangibly differ depending on the content of the upper bits for all of the instructions. RISC doesn’t mean make extra work due to extreme minimalism. It means choosing the instructions emitted by the compiler. Unfortunately there is a lot of code that uses int32_t and uint32_t. We can’t depend on programmers changing their code to use variable width types. I’d say there would be a lot of 32-bit code that uses __builtin_popcount/__builtin_clz/__builtin_ctz and uint32_t vs __builtin_popcountll/__builtin_clzll/__builtin_ctzll and uint64. Having only 32-bit versions is essentially saying it needs to be more than one instruction. The compilers currently emit canonicalisation instructions when a 64-bit instruction is used on a 32-bit value.

I won’t go as far as saying that we should have ANDW/ORW/XORW/NEGW, but if we would save some engineering on REE (Redundant Extension Elimination). The state of the art in open source compilers are not aware that these instructions don’t affect canonicalisation of the upper bits.

Then why not simply define pseudo-instructions ANDW ORW XORW and NOTW that expand to the corresponding full width instruction?
 

Rogier Brussee

unread,
Oct 30, 2017, 10:38:20 AM10/30/17
to RISC-V ISA Dev, michae...@mac.com, jcb6...@gmail.com

The question is how many programs would actually use CTZW/CLZW?  

CLZW is used extensively in softfloat (the W version for floats as opposed to double's)
 

Rogier Brussee

unread,
Oct 30, 2017, 10:55:29 AM10/30/17
to RISC-V ISA Dev, allen...@esperantotech.com, michae...@mac.com, jcb6...@gmail.com

input. To me, this seems like the most logical place to add them would
therefore be SLT/SLTU. 
I am not dead set on this if someone can explain
how those operations make more sense grouped with other hardware.

Just a few extra gates added to SLT and SLTU give you MAX, MAXU MIN and MINU :-)

 
For
example, if POPCNT can be more easily implemented by adding a few gates
to an adder than by modifying the comparator used by SLT or adding a new
functional block, I would support encoding POPCNT near ADD.

> GREVI:
>  
>>> The question is how often will GREVI be used where k is not 31 or 24.
>>>      
>> potentially a lot if you do things like arbitrary bit permutations:
>> http://nbviewer.jupyter.org/url/svn.clifford.at/handicraft/2017/permsyn/data.ipynb
>>    
 
Arbitrary permutations may not be high on the list, but I like the idea
of fewer, more flexible instructions -- GREV gives us both BSWAP and
BREV as special cases and all other possible permutations, while
conveniently fitting into a corner of the opcode space that is otherwise
very hard to use.

Which bit permutations does GREVI actually give. Clearly not all 64 factorial ones.
 

-- Jacob

Bruce Hoult

unread,
Oct 30, 2017, 11:41:15 AM10/30/17
to Rogier Brussee, RISC-V ISA Dev, Allen J. Baum, Michael Clark, Jacob Bachmeyer
On Mon, Oct 30, 2017 at 5:55 PM, Rogier Brussee <rogier....@gmail.com> wrote:
Arbitrary permutations may not be high on the list, but I like the idea
of fewer, more flexible instructions -- GREV gives us both BSWAP and
BREV as special cases and all other possible permutations, while
conveniently fitting into a corner of the opcode space that is otherwise
very hard to use.

Which bit permutations does GREVI actually give. Clearly not all 64 factorial ones.

It doesn't give all 64 factorial permutations in a single instruction, clearly.

Seems to me it's more like a "generalized rotate" instruction. A definition is at https://groups.google.com/a/groups.riscv.org/forum/#!msg/isa-dev/PeTwBaKuC9w/E6kKdcs_AwAJ
 

Samuel Falvo II

unread,
Oct 30, 2017, 11:46:16 AM10/30/17
to Rogier Brussee, RISC-V ISA Dev, Michael Clark, Jacob Bachmeyer
On Mon, Oct 30, 2017 at 7:26 AM, Rogier Brussee
<rogier....@gmail.com> wrote:
>>
>> AND-NOT and OR-NOT have been suggested, and could be easily encoded near
> There is no SUBI either, and for the same reason: just like you can do
> an ANDI with a negated immediate, you can do an AND with a complemented
> immediate
> if you need a ANDNOTI

AND-NOT (NAND) is not the same as inverting one input to an AND operation.

x NAND y == (NOT x) OR (NOT y)

x AND (NOT y) == (NOT x) NOR y

To get ANDNOT, you would need to perform the AND operation, followed
by a XORI r, r, -1.

Bruce Hoult

unread,
Oct 30, 2017, 12:09:41 PM10/30/17
to Samuel Falvo II, Rogier Brussee, RISC-V ISA Dev, Michael Clark, Jacob Bachmeyer
On Mon, Oct 30, 2017 at 6:46 PM, Samuel Falvo II <sam....@gmail.com> wrote:
On Mon, Oct 30, 2017 at 7:26 AM, Rogier Brussee
<rogier....@gmail.com> wrote:
>>
>> AND-NOT and OR-NOT have been suggested, and could be easily encoded near
> There is no SUBI either, and for the same reason: just like you can do
> an ANDI with a negated immediate, you can do an AND with a complemented
> immediate
> if you need a ANDNOTI

AND-NOT (NAND) is not the same as inverting one input to an AND operation.

No.

AND-NOT, also known as BIC (BIt Clear) on CPUs such as PDP11 and VAX or ANDC (AND with Complement) on various IBM machines is not the same operation as NAND. It is literally inverting the 2nd operand to AND.

It is very useful for operations such as inserting a new value into a bitfield: a mask is first generated with 1s in the positions to be inserted, the mask is then applied to the new value using AND, to the destination using ANDC/BIC, and the two results ORd together.

Note that said DEC machines had BIC and BIS (BIt Set, same as OR) mnemonics and did not have plain AND at all -- they did have BIT (BIt Test), which did an AND but did not store the result, affecting only condition codes.

Allen J. Baum

unread,
Oct 30, 2017, 12:14:58 PM10/30/17
to Bruce Hoult, Rogier Brussee, RISC-V ISA Dev, Michael Clark, Jacob Bachmeyer
Specifically, GREV swaps pairs of bits (only).
Which pairs it swaps depends on the immediate field.
It isn't a "rotate" except for the special case of the immediate field = 0x20

The operation is: Output_Bit(N) <--> Input_Bit(N^imm)
Imm==0x3F gets you a 64 bit end-to-end swap (bit reverse)
Imm==0x38 will leave bits within a byte alone, but swaps between byte (byte swap)

At 6:41 PM +0300 10/30/17, Bruce Hoult wrote:
.....

Which bit permutations does GREVI actually give. Clearly not all 64 factorial ones.

It doesn't give all 64 factorial permutations in a single instruction, clearly.

Seems to me it's more like a "generalized rotate" instruction. A definition is at https://groups.google.com/a/groups.riscv.org/forum/#!msg/isa-dev/PeTwBaKuC9w/E6kKdcs_AwAJ
 


Allen J. Baum

unread,
Oct 30, 2017, 12:20:51 PM10/30/17
to Clifford Wolf, Jacob Bachmeyer, Michael Clark, RISC-V ISA Dev
quibble: field extract/deposit take about 20 gates/bit(over and above the barrel shifter) - the tricky part is specifying the operands .

At 12:04 PM +0100 10/30/17, Clifford Wolf wrote:
>Responding to a couple of things in this thread..
>
>
>On Sun, Oct 29, 2017 at 07:45:19PM -0500, Jacob Bachmeyer wrote:
> > Michael Clark wrote:
>.....
>Finally, if we don't include BEXT/BDEP then there is the question if we
>should include weaker versions of it, like a bit field extract instruction
>(such as BEXTR on x86). Those would also require hardware resources. This
>must be taken into consideration when arguing about the cost vs. benefits
>of BEXT/BDEP.
>
>
>And finally regarding that B-suffix discussions:
>
>Just to be absolutely clear: I never meant to imply that the B extension
>instructions would all start with the letter B and tbh I find the idea
>ridiculous.


I still like BGATHER, BSCATTER which actually makes sense

Samuel Falvo II

unread,
Oct 30, 2017, 12:25:07 PM10/30/17
to Bruce Hoult, Rogier Brussee, RISC-V ISA Dev, Michael Clark, Jacob Bachmeyer
On Mon, Oct 30, 2017 at 9:09 AM, Bruce Hoult <br...@hoult.org> wrote:
>> AND-NOT (NAND) is not the same as inverting one input to an AND operation.
>
> No.
>
> AND-NOT, also known as BIC (BIt Clear) on CPUs such as PDP11 and VAX or ANDC

Ahhh, in that case, I agree with what was posted. I had Windows ROps
in mind when I read AND-NOT.

Michael Clark

unread,
Oct 30, 2017, 7:09:17 PM10/30/17
to Allen Baum, Clifford Wolf, Jacob Bachmeyer, RISC-V ISA Dev


> On 31/10/2017, at 5:20 AM, Allen J. Baum <allen...@esperantotech.com> wrote:
>
> quibble: field extract/deposit take about 20 gates/bit(over and above the barrel shifter) - the tricky part is specifying the operands .
>
> At 12:04 PM +0100 10/30/17, Clifford Wolf wrote:
>> Responding to a couple of things in this thread..
>>
>>
>> On Sun, Oct 29, 2017 at 07:45:19PM -0500, Jacob Bachmeyer wrote:
>>> Michael Clark wrote:
>> .....
>> Finally, if we don't include BEXT/BDEP then there is the question if we
>> should include weaker versions of it, like a bit field extract instruction
>> (such as BEXTR on x86). Those would also require hardware resources. This
>> must be taken into consideration when arguing about the cost vs. benefits
>> of BEXT/BDEP.
>>
>>
>> And finally regarding that B-suffix discussions:
>>
>> Just to be absolutely clear: I never meant to imply that the B extension
>> instructions would all start with the letter B and tbh I find the idea
>> ridiculous.
>
>
> I still like BGATHER, BSCATTER which actually makes sense

The names in the literature are Parallel Extract and Parallel Deposit however I agree Bit Gather and Bit Scatter are more intuitive names, but unfortunately the opcodes are very long and will much up tabs.

I originally had PBE/PBD or PBG/PBS ie. Parallel Bit Extract / Parallel Bit Deposi or Parallel Bit Gather / Parallel Bit Scatter for them.

Below is the candidate list using standard names for CLZ/CTZ/POPCOUNT. I liked the former BCLZ/BCTZ/BCNT because of the more regular naming scheme and tab friendliness for objdump output however that is a minor issue that I’m not too fussed about.

I think it’s worthwhile creating a branch of GCC with these instructions so we can perform some evaluation (the first 9 all have existing intrinsics or lifts):

RLL[W] rd,rs1,rs2 Rotate Left Logical Rotate bits in rs1 left by the amount in rs2
RRL[W] rd,rs1,rs2 Rotate Right Logical Rotate bits in rs1 right by the amount in rs2
RLLI[W] rd,rs1,shamt Rotate Left Logical Immediate Rotate bits in rs1 left by the immediate
RRLI[W] rd,rs1,shamt Rotate Right Logical Immediate Rotate bits in rs1 right by the immediate
CLZ[W] rd,rs1 Count Leading Zeros Count leading zero bits in rs1
CTZ[W] rd,rs1 Count Trailing Zeros Count trailing zero bits in rs1
POPCOUNT[W] rd,rs1 Population Count Count number of bits set in rs1
BSWAP[W] rd,rs1 Byte Swap Swap byte order in rs1
BREV[W] rd,rs1 Bit Reverse Swap byte order in rs1
BEXT[W] rd,rs1,rs2 Bit Extract Gather LSB justified bits to rd from rs1 using extract mask in rs2
BDEP[W] rd,rs1,rs2 Bit Deposit Scatter LSB justified bits from rs2 to rd using deposit mask in rs2

Michael Clark

unread,
Oct 30, 2017, 7:46:48 PM10/30/17
to Allen Baum, Clifford Wolf, Jacob Bachmeyer, RISC-V ISA Dev
and we can make RV32 BSWAP/BREV and RV64 BSWAP[W]/BREV[W] pseudos using the appropriate k values for GREVI

We can perhaps evaluate with and without W; as there are arguments both ways. The compiler distinguishes between them as it has 32-bit and 64-bit variants for the associated builtins.

Jacob Bachmeyer

unread,
Oct 30, 2017, 9:45:03 PM10/30/17
to Rogier Brussee, RISC-V ISA Dev, michae...@mac.com
Good; there is a use for it. However, CLZW is equivalent to "SLLI /
CLZ" for non-zero inputs and to "SLMI / CLZ" for all inputs. (SLMI is
the left mask-shift I propose that inserts ones.) Would softfloat code
care about CLZW on zero producing XLEN instead of 32? Would softfloat
code even apply CLZ to a zero value? Is a two-instruction sequence
sufficient or is CLZW used frequently enough to warrant its own hardware
opcode?


-- Jacob

Jacob Bachmeyer

unread,
Oct 30, 2017, 9:52:43 PM10/30/17
to Rogier Brussee, RISC-V ISA Dev
Rogier Brussee wrote:
> [...]
>
>
> ## Rotate
>
> RLL[W] rd,rs1,rs2 Rotate Left
> Logical Rotate bits in rs1 left by
> the amount in rs2
> RRL[W] rd,rs1,rs2 Rotate Right
> Logical Rotate bits in rs1 right by the
> amount in rs2
> RLLI[W] rd,rs1,shamt Rotate Left Logical
> Immediate Rotate bits in rs1 left by the immediate
> RRLI[W] rd,rs1,shamt Rotate Right Logical
> Immediate Rotate bits in rs1 right by the immediate
>
> Rotate instructions can be implemented very easily with the
> addition of carry-out carry-in logic to an existing shifter.
> Rotates are very simple instructions but they are frequently used
> in cryptographic ciphers so the small saving in cycles (3:1) is
> likely worth the additional area for processors that implement the
> B extension.
>
>
> I don't see why one would need both a left and right rotate. E.g. RRLI
> rd rs1 shamt gives exactly the same result as RLLI rd rs1 (XLEN - shamt)

For symmetry with the shift instructions, where both directions are
needed, although I would drop the trailing "L" from the mnemonics --
there is no "arithmetic rotate".

>
> ## Byte swap
>
> BSWAP[W] rd,rs1 Byte
> Swap Swap byte order in rs1
>
> Byte swapping instructions are essential for networking code and
> cryptographic ciphers which typically use big endian formats. A
> 32-bit byte swap takes ~14 instructions and a 64-bit byte swap
> takes ~30 instructions.
>
>
>
> Even BSWAPH might be worse it as it is used so extensively in
> networking code. I guess that could be a GREVI mnemonic too.

This is why I favor GREVI over individual swap/reverse instructions.


-- Jacob

Andrew Waterman

unread,
Oct 30, 2017, 10:13:38 PM10/30/17
to Jacob Bachmeyer, Rogier Brussee, RISC-V ISA Dev, Michael Clark
FWIW, GCC is already programmed to cope with clz(0) having different
values for different operand sizes, or even with clz(0) being
undefined.
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/59F7D59A.20308%40gmail.com.

Michael Clark

unread,
Oct 30, 2017, 10:17:46 PM10/30/17
to jcb6...@gmail.com, Rogier Brussee, RISC-V ISA Dev


> On 31/10/2017, at 2:52 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> For symmetry with the shift instructions, where both directions are needed, although I would drop the trailing "L" from the mnemonics -- there is no "arithmetic rotate".

There is no arithmetic shift left nevertheless RISC-V uses “Shift Left Logical”

There are some rotate instructions on some architectures that read the carry and thus are not logical rotates. There is no harm in being specific/consistent.

Jacob Bachmeyer

unread,
Oct 30, 2017, 11:01:31 PM10/30/17
to Clifford Wolf, Michael Clark, Allen Baum, RISC-V ISA Dev
Clifford Wolf wrote:
> Responding to a couple of things in this thread..
>
>
> On Sun, Oct 29, 2017 at 07:45:19PM -0500, Jacob Bachmeyer wrote:
>
>> Michael Clark wrote:
>>
>>> BEXT/BDEP [...] They are sufficiently new that not all possible
>>> applications have been explored.
>>>
>> This is why I have been repeatedly expressing concerns about patents
>> on PEXT/PDEP. "Sufficiently new" is a serious cause for concern
>> here.
>>
>
> On Sun, Oct 29, 2017 at 08:05:17PM -0500, Jacob Bachmeyer wrote:
>
>>> Also: I think I know which patent you mean (US 8285766 B2). Hilewitz
>>> himself published his method for BEXT/BDEP 4 years before filing the
>>> patent.
>>>
>> I do not know of any specific patents on the matter and was
>> expressing a generalized concern. While I (not-a-lawyer) understand
>> that publishing a method 4 years before filing a patent probably
>> invalidates the patent, surely the Foundation has lawyers for just
>> this type of problem that can give a solid answer. Could there be
>> other patents that could bite us on these instructions? "Submarine
>> patents" are my big concern here.
>>
>
> I think you completely overestimate how far the B extension WG is in their
> process. It's not like we are going to release a final B spec tomorrow and
> *forgot* to check things like that.
>

Are you saying that the reason we have heard nothing from BitManip WG is
that there is not even a first draft yet?

> Nobody makes a decision now if BEXT/BDEP will be included or not in a final
> spec. If we include them in the first draft spec that just means we are in
> a position to be able to evaluate their utility, by adding experimental
> compiler support, and running those compilers on real bit manipulation
> code, etc. Then we have real data to base a decision on.
>
> If we exclude them now then none of those things will happen and we will
> never know if it would have been a good idea to include them.
>

Fair enough; include them in the first draft, then.

> And further regarding the utility of those instructions:
>
> First, it's not like we are the first to suggest those instructions. If you
> have an Intel Haswell or AMD Excavator processor then you have BMI2, which
> means your x86 processor supports PDEP and PEXT today.
>

If we were the first, I would be less concerned about patents. The fact
that references I find suggest that Intel may have been the first to
actually implement these instructions concerns me. Weak patents (and
sometimes seemingly no actual patents at all...) still work quite well
for FUD.

> Second, there are operations that are known to be common in bit
> manipulation code, such as shuffle/unshuffle and sheep-and-goats, that can
> easily be implemented with BEXT/BDEP (+GREV for sheep-and-goats). So it's
> not like we are inventing entirely new problems here to justify BEXT/BDEP.
> There are a bunch of existing problems that BEXT/BDEP can solve.
>

I am not questioning the usefulness of BEXT/BDEP, especially since it
seems that GREV requires similar (or even the same?) hardware.

> Finally, if we don't include BEXT/BDEP then there is the question if we
> should include weaker versions of it, like a bit field extract instruction
> (such as BEXTR on x86). Those would also require hardware resources. This
> must be taken into consideration when arguing about the cost vs. benefits
> of BEXT/BDEP.
>

Field extract/insert go all the way back to the PDP-10 (as LDB/DPB,
which are still names for those functions in Common Lisp) so have no
patent issues, but are infeasible to cleanly encode in RISC-V. I
earlier suggested an encoding that used a packed control word in a
register, but that would be a bit of a hack and BEXT/BDEP would be
preferable. The only reason not to include BEXT/BDEP (assuming GREV can
use the same hardware) I see at this point is if those are somehow patented.


-- Jacob

Clifford Wolf

unread,
Oct 31, 2017, 6:01:19 AM10/31/17
to Jacob Bachmeyer, Michael Clark, Allen Baum, RISC-V ISA Dev
On Mon, Oct 30, 2017 at 10:01:24PM -0500, Jacob Bachmeyer wrote:
>> I think you completely overestimate how far the B extension WG is in their
>> process. It's not like we are going to release a final B spec tomorrow and
>> *forgot* to check things like that.
>
> Are you saying that the reason we have heard nothing from BitManip
> WG is that there is not even a first draft yet?

I said there is not going to be a final spec tomorrow.

It is a complete mystery to me how you can read that as "there is not even
a first draft". Not replying to the rest bc obviously it doesn't matter
what I write..

Clifford Wolf

unread,
Oct 31, 2017, 8:23:36 AM10/31/17
to Allen J. Baum, Michael Clark, jcb6...@gmail.com, RISC-V ISA Dev
Hi,

On Sun, Oct 29, 2017 at 04:22:24PM -0700, Allen J. Baum wrote:
> GREVI:
> >> The question is how often will GREVI be used where k is not 31 or 24.
> >
> >potentially a lot if you do things like arbitrary bit permutations:
> >http://nbviewer.jupyter.org/url/svn.clifford.at/handicraft/2017/permsyn/data.ipynb
>
> I was wondering the same thing - and I don't think this permutation
> result answers the question. If arbitrary permutations are very high on
> the list of apps we will run, sure - but overwhelmingly they're not. And,
> for the special cases: generally you'll have custom HW that will do
> exactly those permutations (e.g. crypto apps). Instead of shrinking from
> average 35-ish ops to 20-ish ops , it shrinks it to 1 op in those cases.

(1) I never said that the bit permutation study answers the question. It's
just one more data point.

(2) This is the B extension. By definition, the features added by it are
aimed at bit manipulation code and will not be benefiting other code a lot.
So discussing what general non-bitmanip code needs is a strawman.

(3) You need log2(XLEN) bits of parameters for a general permutation
operation. You are not going to get 1 op per permutation unless you come up
with a strategy to transfer log2(XLEN) parameter words into your custom
permutation HW in one op (in addition to transfering the operand). So
including the operand this is 6 words for 32 bits and 7 words for 64 bits.

(4) The difference between ROT+GREV and ROT+GREV+BEXT+BDEP in the study is
consistenty >2x for 5 bits or more and consistently >3x for ROT vs
ROT+GREV+BEXT+BDEP for more than 7 bits (see last image in the jupyter
notebook). Your "35-ish ops to 20-ish ops" would correspond to a factor of
1.75x. So your are grossly misrepresenting my results.

(5) The study only looks at the actual data operations needed per data
word, not the additional operations needed to load parameter constants. A
difference of >2x can mean much higher register pressure and may prevent
you from doing more than one permutation operation in a loop iteration, so
without BEXT/BDEP you might need multiple passes over the data or need
additional instructions to re-load the same constants in each loop
iteration.

And again: I'm not saying I know that ultimately BEXT+BDEP should be
included. All I'm saying is that if you say it shouldn't then I seriously
doubt that you have the numbers to support that statement, and the path to
getting those numbers is by including BEXT+BDEP in the B extension draft.

> While the addition of BEXT and BDEP shrinks the number of ops, it isn't
> clear how much it shrinks latency, unless they have latency equivalent
> to an adder (so, single cycle).

This is not an ISA question. Like with multiply/divide, or even shift,
different implementations may implement instructions differently. The
compilers will need performance profiles for each processor that help it
pick the right instruction sequence for a given program and target processor.

The ISA only defines which instructions must exist, not how efficient they
must be implemented.

> My understanding was that the low-cost implementations were multicycle,
> or at least very high latency. Is there a summary of gate delays vs.
> number of gates for the cores you've looked at anywhere? When I've looked
> at it in the (extremely distant) past, it was roughly a gate delay per
> bit of width (in reality worse, since a 2-2 AOI mux has more delay than a
> 2in NAND).

I don't know what implementation you looked at, but the delay of a
single-stage BEXT/BDEP implementation should be roughly proportional to the
logarithm of XLEN, not proportional to XLEN.

I've now added the gate delays to the README for my 19 reference cores.

For example my single-stage BEXT+BDEP+GREV core has a delay of 49 gate
times in the 64 bit variant, and 38 gate times in the 32 bit variant.

This evaluations map the cores to a library of NOT, NAND, and NOR. No
AOI or OAI gates are used in my evaluation.

>> Synthetic biology is what immediately comes to mind (nucleotide sequences require radix=4) but there are likely many potential future applicationsŠ
>
> Again: if this a really important app, then a synthetic biology coprocessor would make more sense.

Not how most of the high performance computing is done today.. In most
cases one will have to make do with what commodity processors provide.

regards,
- clifford

Rogier Brussee

unread,
Oct 31, 2017, 8:54:16 AM10/31/17
to RISC-V ISA Dev, rogier....@gmail.com, michae...@mac.com, jcb6...@gmail.com


Op dinsdag 31 oktober 2017 02:45:03 UTC+1 schreef Jacob Bachmeyer:
Don't know really. From the UCBAR softfloat lib 



#ifndef softfloat_countLeadingZeros32
/*----------------------------------------------------------------------------
| Returns the number of leading 0 bits before the most-significant 1 bit of
| 'a'. If 'a' is zero, 32 is returned.
*----------------------------------------------------------------------------*/
#if defined INLINE_LEVEL && (3 <= INLINE_LEVEL)
INLINE uint_fast8_t softfloat_countLeadingZeros32( uint32_t a )
{
uint_fast8_t count = 0;
if ( a < 0x10000 ) {
count = 16;
a <<= 16;
}
if ( a < 0x1000000 ) {
count += 8;
a <<= 8;
}
count += softfloat_countLeadingZeros8[a>>24];
return count;
}
#else
uint_fast8_t softfloat_countLeadingZeros32( uint32_t a );
#endif
#endif


 

-- Jacob

Rogier Brussee

unread,
Oct 31, 2017, 9:00:55 AM10/31/17
to RISC-V ISA Dev, allen...@esperantotech.com, clif...@clifford.at, jcb6...@gmail.com


Op dinsdag 31 oktober 2017 00:09:17 UTC+1 schreef michaeljclark:
> I still like BGATHER, BSCATTER which actually makes sense

The names in the literature are Parallel Extract and Parallel Deposit however I agree Bit Gather and Bit Scatter are more intuitive names, but unfortunately the opcodes are very long and will much up tabs.


BGATH (pronounced as in biblical geneologies :-) ) and  BSCAT then?  
Message has been deleted

Rogier Brussee

unread,
Oct 31, 2017, 9:19:55 AM10/31/17
to RISC-V ISA Dev, rogier....@gmail.com, jcb6...@gmail.com

> I don't see why one would need both a left and right rotate. E.g. RRLI
> rd rs1 shamt  gives exactly the same result as RLLI rd rs1 (XLEN - shamt)

For symmetry with the shift instructions, where both directions are
needed, although I would drop the trailing "L" from the mnemonics --
there is no "arithmetic rotate".

I like symmetry too but that does not seem a good enough reason to me, especially for the immediate version. 
 
>
>     ## Byte swap
>
>     BSWAP[W] rd,rs1                Byte
>     Swap                                        Swap byte order in rs1
>
>     Byte swapping instructions are essential for networking code and
>     cryptographic ciphers which typically use big endian formats. A
>     32-bit byte swap takes ~14 instructions and a 64-bit byte swap
>     takes ~30 instructions.
>
>
>
> Even BSWAPH might be worse it as it is used so extensively in
> networking code.  I guess that could be a GREVI mnemonic too.

This is why I favor GREVI over individual swap/reverse instructions.


I wrote nonsense there I think. Just as BSWAPW rd rs1  could be a pseudo
BSWAP rd rs1; SRLA rd rd -32, BSWAPH could be pseudo for BSWAP rd rs1; SRLA rd rd -16 
(both of which are good targets for fusion) and it does not make a difference whether 
BSWAP is itself a mnemonic for GREVI or not.

-- Jacob

Bruce Hoult

unread,
Oct 31, 2017, 9:22:18 AM10/31/17
to Clifford Wolf, Allen J. Baum, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
On Tue, Oct 31, 2017 at 3:23 PM, Clifford Wolf <clif...@clifford.at> wrote:
(2) This is the B extension. By definition, the features added by it are
aimed at bit manipulation code and will not be benefiting other code a lot.
So discussing what general non-bitmanip code needs is a strawman.

They should still be useful on most high volume general purpose computing devices. Things useful for, say, secure communications over networks (while minimising power consumption) would I suggest fall under the bar. Things only useful for simulating nuclear weapons or analysing data from medical or geological instruments might not -- those guys can afford to make custom processors.
 
(3) You need log2(XLEN) bits of parameters for a general permutation
operation.

I think you mean XLEN*log2(XLEN) bits: XLEN specifiers (one for each bit position), each specifying which bit of the input should be copied to that bit of the output.

Which is more powerful than simple permutation, as multiple bits can be copied from the same source bit (and other bits not copied anywhere), but whatever.

Perhaps you meant log2(XLEN) words of parameters.

Clifford Wolf

unread,
Oct 31, 2017, 10:16:42 AM10/31/17
to Bruce Hoult, Allen J. Baum, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
Hi,

On Tue, Oct 31, 2017 at 04:22:16PM +0300, Bruce Hoult wrote:
> > (2) This is the B extension. By definition, the features added by it are
> > aimed at bit manipulation code and will not be benefiting other code a lot.
> > So discussing what general non-bitmanip code needs is a strawman.
>
> They should still be useful on most high volume general purpose computing
> devices. Things useful for, say, secure communications over networks (while
> minimising power consumption) would I suggest fall under the bar. Things
> only useful for simulating nuclear weapons or analysing data from medical
> or geological instruments might not -- those guys can afford to make custom
> processors.

I think you are mixing things here that really have nothing to do with each
other, namely bitmanip/non-bitmanip code (which is what I was talking about)
and concrete application domains that might make use of bitmanip code.

Some bitmanip operations might have applications in "simulating nuclear
weapons or analysing data from medical or geological instruments". But that
does not mean that they are only useful in that one domain! And I seriously
doubt that you will find any opcode in the current B draft spec that can be
connected to just one of those application domains.

I was talking about bit-manipulation vs. non bit-manipulation code. Some
applications domains have more bit-manipulation code than others. If you
build HW specifically for a domain that need bit-manipulation code then
just don't implement the B extension.

So your point is a red herring and has nothing to do with the issues being
discussed.

If you disagree then you should actually name the specific instruction from
the draft B spec that you think is only useful in a narrow application
domain. Then we can discuss that specific claim. Otherwise this is just
irrelevant noise.

> > (3) You need log2(XLEN) bits of parameters for a general permutation
> > operation.
>
> Perhaps you meant log2(XLEN) words of parameters.

yes. That's what I meant.

> Which is more powerful than simple permutation, as multiple bits can be
> copied from the same source bit (and other bits not copied anywhere), but
> whatever.

yes, for a 32 bit permutation you'd need 117.663 bits [1] instead of 160
bits. but I would not know of any good encoding in say max 128 bits that
is easy to generate and easy to process in hardware. The 160 bit encoding
is trivial to process: simply use XLEN XLEN-to-1 muxes.

Please let me know if you know of a practical encoding that would get rid
of one parameter word. I'm always interested in stuff like that.

regards,
- clifford

[1] Result of "np.sum([np.log2(n+1) for n in range(32)])". I hope that's
correct.. :)

Rogier Brussee

unread,
Oct 31, 2017, 10:39:51 AM10/31/17
to RISC-V ISA Dev, clif...@clifford.at, allen...@esperantotech.com, michae...@mac.com, jcb6...@gmail.com, br...@hoult.org


Op dinsdag 31 oktober 2017 14:22:18 UTC+1 schreef Bruce Hoult:
On Tue, Oct 31, 2017 at 3:23 PM, Clifford Wolf <clif...@clifford.at> wrote:
(2) This is the B extension. By definition, the features added by it are
aimed at bit manipulation code and will not be benefiting other code a lot.
So discussing what general non-bitmanip code needs is a strawman.

They should still be useful on most high volume general purpose computing devices. 
Things useful for, say, secure communications over networks (while minimising power consumption) would I suggest fall under the bar.

F.W.I.W.  I mostly agree but the problem is of course that everybody thinks its own applications are super important. 
I would also add that things that are common in embedded use would seem reasonable.  Last reason I can think of is that it opens
up useful opportunities for fusing instructions and the instructions themselves are cheap.

So I think it would be really useful to have a list of important applications going with the instructions and alternatives to cater for
these instructions! Here is my attempt using my memory of what I have seen pass by and happen to know. YMMV.


Opcode                                application(s)                                       Alternatives 
RLL[W] rd,rs1,rs2                 Cryptography, fusion                                
RRL[W] rd,rs1,rs2                 same                                                   ADDI rd, rs2, -XLEN; RRL[w] rd, rs1, rd                          
RLLI[W] rd,rs1,shamt            same 
RRLI[W] rd,rs1,shamt           same                                                   RLLI rd, rs1, (XLEN -shamt)
BCLZ[W] rd,rs1                     softfloat                                               
BCTZ[W] rd,rs1                     ????                                                    BREV rd rs1; BCLZ rd, rd                                             
BCNT[W] rd,rs1                     malloc, bitmaps
BREV[W] rd,rs1                     reverse addresses, bitmaps??            GREVI rd, (XLEN -1)
BSWAP[W] rd,rs1                  be <-> le e.g. networking,                   GREVI rd, (XLEN - 8) (?)

GREVI[W] rd rs1 shamt        see BSWAP, BREV bitpermutations   BSWAP, BREV  instructions for common case, trapping for others     
BGATH[W] rd,rs1,rs2            instruction decoding in traps, bitmaps  masking and shifting, instructions or csr's that directly "parse" the trapping instruction into (instr, rd, rs1, rs2, rs3, imm/shamt)??????. 
BSCAT[W] rd,rs1,rs2            bitmaps                                                 masking and shifting.
 
ANDNOT rd rs1 rs2             clear bit, fusion                                      XORI rd, rs2, -1; AND rd, rd, rs1,  (possibly extending C extension with not instruction)
ORNOT    rd rs1 rs2             masking, fusion                                    XORI rd, rs2, -1; OR rd, rd, rs1,     
SLLM        rd rs1 rs2           constructing masks, fusion                    XORI rd, rs1, -1; SLL rd, rd, rs2; XORI rd, rd, -1
SLLMI       rd rs1 shamt       same                                                     XORI rd, rs1, -1; SLLI rd, rd, shamt; XORI rd, rd, -1




 

Bruce Hoult

unread,
Oct 31, 2017, 10:56:32 AM10/31/17
to Clifford Wolf, Allen J. Baum, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
On Tue, Oct 31, 2017 at 5:16 PM, Clifford Wolf <clif...@clifford.at> wrote:
Hi,

On Tue, Oct 31, 2017 at 04:22:16PM +0300, Bruce Hoult wrote:
> > (2) This is the B extension. By definition, the features added by it are
> > aimed at bit manipulation code and will not be benefiting other code a lot.
> > So discussing what general non-bitmanip code needs is a strawman.
>
> They should still be useful on most high volume general purpose computing
> devices. Things useful for, say, secure communications over networks (while
> minimising power consumption) would I suggest fall under the bar. Things
> only useful for simulating nuclear weapons or analysing data from medical
> or geological instruments might not -- those guys can afford to make custom
> processors.

I think you are mixing things here that really have nothing to do with each
other, namely bitmanip/non-bitmanip code (which is what I was talking about)
and concrete application domains that might make use of bitmanip code.

I disagree.

If some specialised device doesn't do bit manipulation at all (or hardly at all) then of course it doesn't need to implement the B extension. Of course.

My point (one of them) is that anything even general purpose enough to do secure communications over a network (e.g. IoT) is going to want the performance but especially the power consumption benefits of bit manipulation instructions useful for, say, network byte order reversal (RISC-V being LE), AES or similar, compression...

 
Some bitmanip operations might have applications in "simulating nuclear
weapons or analysing data from medical or geological instruments". But that
does not mean that they are only useful in that one domain!

Of course. And others might be. I don't know what is useful in those domains, or useful only in those domains. Quite possibly there is something they'd like but no one else needs. It's only a hypothetical example. Let's hypothetically say that is the case. It shouldn't be in B. I'm not sure how that is controversial.
 
And I seriously
doubt that you will find any opcode in the current B draft spec that can be
connected to just one of those application domains.

The draft spec is not publicly available, as far as I'm aware. That's been an ongoing topic in this thread.
 
I was talking about bit-manipulation vs. non bit-manipulation code. Some
applications domains have more bit-manipulation code than others. If you
build HW specifically for a domain that need bit-manipulation code then
just don't implement the B extension.

I assume you mean "that doesn't need". In which case: obviously. But if you need to communicate then you'd probably benefit from bit manipulation instructions. 


So your point is a red herring and has nothing to do with the issues being
discussed.

If you disagree then you should actually name the specific instruction from
the draft B spec that you think is only useful in a narrow application
domain. Then we can discuss that specific claim. Otherwise this is just
irrelevant noise.

We (or at least I) don't have access to the draft B spec.

Clifford Wolf

unread,
Oct 31, 2017, 11:43:45 AM10/31/17
to Bruce Hoult, Allen J. Baum, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
On Tue, Oct 31, 2017 at 05:56:30PM +0300, Bruce Hoult wrote:
> > If you disagree then you should actually name the specific instruction from
> > the draft B spec that you think is only useful in a narrow application
> > domain. Then we can discuss that specific claim. Otherwise this is just
> > irrelevant noise.
>
> We (or at least I) don't have access to the draft B spec.

I'm sure your work on RV8 would qualify you for a free individual
membership in the RISC-V Foundation that would allow you to join the B
extension work group.

Allen J. Baum

unread,
Oct 31, 2017, 1:15:25 PM10/31/17
to Clifford Wolf, Michael Clark, jcb6...@gmail.com, RISC-V ISA Dev
I think most, if not all, of your criticisms are valid.
There are assumptions I was making which are quite different than yours- which isn't to say my assumptions are correct.
e.g. For #3, I was assuming  known fixed permutations (as used in, say AES) rather than arbitrary permutations. Similar assumptions are in #4 (I assumed full width always as being the most common)

Looking (now) at your table - it isn't clear to me how many cycles each core takes, and what the latency of each is - could you add those to the table?
>> Synthetic biology is what immediately comes to mind (nucleotide sequences require radix=4) but there are likely many potential future applications·

>
> Again: if this a really important app, then a synthetic biology coprocessor would make more sense.

Not how most of the high performance computing is done today.. In most
cases one will have to make do with what commodity processors provide.

regards,
 - clifford

Allen J. Baum

unread,
Oct 31, 2017, 1:28:51 PM10/31/17
to Bruce Hoult, Clifford Wolf, Michael Clark, Jacob Bachmeyer, RISC-V ISA Dev
At 5:56 PM +0300 10/31/17, Bruce Hoult wrote:
>...
>If some specialised device doesn't do bit manipulation at all (or hardly at all) then of course it doesn't need to implement the B extension. Of course.
>
>My point (one of them) is that anything even general purpose enough to do secure communications over a network (e.g. IoT) is going to want the performance but especially the power consumption benefits of bit manipulation instructions useful for, say, network byte order reversal (RISC-V being LE), AES or similar, compression...


My take on this is that something like IOT that care about power will most likely implement byte order reversal, and AES, in HW, possibly in the IO interface.

For specialized apps, specialized HW is the way to go.

That does not preclude the discussion of bit manipulation instructions, just a design point to consider.

Clifford Wolf

unread,
Oct 31, 2017, 2:28:35 PM10/31/17
to Allen J. Baum, Michael Clark, jcb6...@gmail.com, RISC-V ISA Dev
Hi,

On Tue, Oct 31, 2017 at 10:15:18AM -0700, Allen J. Baum wrote:
> Looking (now) at your table - it isn't clear to me how many cycles
> each core takes, and what the latency of each is - could you add
> those to the table?

Just to make sure we are on the same page, we are talking about the tables
on https://github.com/cliffordwolf/bextdep/

The initiation interval (II) of the "pipeline" and "single-stage" cores is 1,
meaning that it can compute one BEXT, BDEP, or GREV instruction every cycle.
The pipeline depth (1 for "single stage", 2 or 3 for "2-stage pipeline" and
"3-stage pipeline") is the number of cycles between reading a job from the
input stream and making the results available on the output stream.

The II of the "4-cycles", "8-cycles", and "16-cycles" cores is 5, 9, and
17, as stated in the text. The "up-to-*-cycles" versions have an II one
larger then the numbers of set bits in the mask.

I'm not sure if you mean "latency" in terms of clock cycles between
consuming input and making output available, or in term of max. path
delay.

The "single-stage" cores make the results available in the next cycle, i.e.
there is exactly one FF stage between input and output. The 2 and 3 stage
versions have two or three FF stages between input and output.

For the sequential cores the latency is equal to the II.

For the FPGA evaluations for Xilinx 7-series you can calculate the max path
delay by calculating the reciprocal value of "Max Freq.".

For the ASIC evaluations I provide path delay information in terms of
number of gate delays. (The "Delays" column in the table.) You'd have to
multiply that by an estimated gate delay time for your target fab process.

(The ASIC evaluations have been done with the tools I wrote. I have no idea
how this gate counts compare to the results produced by the commercial
tools since I don't have access to them.)

If you have a concrete fab process (and the design tools for it), then you
can simply synthesize the cores for that process with your tools to get
the numbers. The Verilog design files are bextdep.v and bextdep_pps.v in
that directory.

The Makefile also contains make targets to run pre- and post-synthesis
simulation of all the cores (using Icarus Verilog as simulator). E.g.
run "make test32g1" to run the test bench for the "bextdep32g1" core. This
will also generate a "testbench.vcd" file.

regards,
- clifford

Michael Clark

unread,
Oct 31, 2017, 3:19:27 PM10/31/17
to Clifford Wolf, Bruce Hoult, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev
I joined one of the earlier BitManip conference calls and there seemed to be consensus to share progress with the RISC-V community at large and some work was done to create a status update that I thought was going to be shared with the list, however there doesn’t seem to have been any updates shared with the RISC-V community at large. I was not a member of the RISC-V workgroup system BitManip group so was not aware of any more recent ongoing work however I am quite interested in BitManip instructions as RV8 could benefit from a compiler branch that supports emitting a draft B extension. They are easy to add in binary translation as many of these instructions are available on many architectures.

As I mentioned in this RFC I’m personally quite interested in the first 9 of the 12 instructions that I shared independently from the BitManip group because a fair amount of code uses their pre-existing compiler intrinsics or lifts (BSWAP/ROTATE are detected and lifted by both GCC/Clang and the other 4 have compiler builtins). I have some benchmarks that use these intrinsics on RISC-V presently but they generate very inefficient code (e.g. RV64 clz on int32 currently calls code with additional canonicalisation and a bug that performs 4 redundant loop iterations as there is only a 64-bit versions of the clz intrinsic in RV64 libgcc). The first set of 9 is what I would call an essential set based on the principles of RISC, that only a small number of instructions in an ISA are actually used by compilers. While GREVI/BEXT/BDEP are interesting, there is not a lot of existing code that uses them. We shouldn’t necessarily use that as a reason to exclude them if we believe there are potential applications that could benefit from them.

These 9 have bodies of code that use pre-existing compiler builtins or lifts.

RLL[W] rd,rs1,rs2 Rotate Left Logical Rotate bits in rs1 left by the amount in rs2
RRL[W] rd,rs1,rs2 Rotate Right Logical Rotate bits in rs1 right by the amount in rs2
RLLI[W] rd,rs1,shamt Rotate Left Logical Immediate Rotate bits in rs1 left by the immediate
RRLI[W] rd,rs1,shamt Rotate Right Logical Immediate Rotate bits in rs1 right by the immediate
BCLZ[W] rd,rs1 Bit Count Leading Zeros Count leading zero bits in rs1
BCTZ[W] rd,rs1 Bit Count Trailing Zeros Count trailing zero bits in rs1
BCNT[W] rd,rs1 Bit Count Count number of bits set in rs1
BSWAP[W] rd,rs1 Byte Swap Swap byte order in rs1
BREV[W] rd,rs1 Bit Reverse Reverse bits in rs1

These 3 are a little more obscure however their potential usefulness is not disputed. They could potentially be included in a profile like M minus DIV or FD minus FDIV/FSQRT. i.e. present on application processors, but not present in embedded processors.

Michael Clark

unread,
Oct 31, 2017, 3:34:43 PM10/31/17
to Rogier Brussee, RISC-V ISA Dev, Clifford Wolf, Allen Baum, jcb6...@gmail.com, br...@hoult.org


> On 1/11/2017, at 3:39 AM, Rogier Brussee <rogier....@gmail.com> wrote:
>
>
>
> Op dinsdag 31 oktober 2017 14:22:18 UTC+1 schreef Bruce Hoult:
> On Tue, Oct 31, 2017 at 3:23 PM, Clifford Wolf <clif...@clifford.at> wrote:
> (2) This is the B extension. By definition, the features added by it are
> aimed at bit manipulation code and will not be benefiting other code a lot.
> So discussing what general non-bitmanip code needs is a strawman.
>
> They should still be useful on most high volume general purpose computing devices.
> Things useful for, say, secure communications over networks (while minimising power consumption) would I suggest fall under the bar.
>
> F.W.I.W. I mostly agree but the problem is of course that everybody thinks its own applications are super important.
> I would also add that things that are common in embedded use would seem reasonable. Last reason I can think of is that it opens
> up useful opportunities for fusing instructions and the instructions themselves are cheap.
>
> So I think it would be really useful to have a list of important applications going with the instructions and alternatives to cater for
> these instructions! Here is my attempt using my memory of what I have seen pass by and happen to know. YMMV.
>
>
> Opcode application(s) Alternatives
> RLL[W] rd,rs1,rs2 Cryptography, fusion
> RRL[W] rd,rs1,rs2 same ADDI rd, rs2, -XLEN; RRL[w] rd, rs1, rd
> RLLI[W] rd,rs1,shamt same
> RRLI[W] rd,rs1,shamt same RLLI rd, rs1, (XLEN -shamt)
> BCLZ[W] rd,rs1 softfloat
> BCTZ[W] rd,rs1 ???? BREV rd rs1; BCLZ rd, rd

Count Trailing Zeros is used frequently for a fast log2 of an integer. I come across it relatively often.

> BCNT[W] rd,rs1 malloc, bitmaps
> BREV[W] rd,rs1 reverse addresses, bitmaps?? GREVI rd, (XLEN -1)
> BSWAP[W] rd,rs1 be <-> le e.g. networking, GREVI rd, (XLEN - 8) (?)
>
> GREVI[W] rd rs1 shamt see BSWAP, BREV bitpermutations BSWAP, BREV instructions for common case, trapping for others
> BGATH[W] rd,rs1,rs2 instruction decoding in traps, bitmaps masking and shifting, instructions or csr's that directly "parse" the trapping instruction into (instr, rd, rs1, rs2, rs3, imm/shamt)??????.
> BSCAT[W] rd,rs1,rs2 bitmaps masking and shifting.
>
> ANDNOT rd rs1 rs2 clear bit, fusion XORI rd, rs2, -1; AND rd, rd, rs1, (possibly extending C extension with not instruction)
> ORNOT rd rs1 rs2 masking, fusion XORI rd, rs2, -1; OR rd, rd, rs1,
> SLLM rd rs1 rs2 constructing masks, fusion XORI rd, rs1, -1; SLL rd, rd, rs2; XORI rd, rd, -1
> SLLMI rd rs1 shamt same XORI rd, rs1, -1; SLLI rd, rd, shamt; XORI rd, rd, -1
>
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/44446493-0eba-46d7-97b7-26058b04a8e5%40groups.riscv.org.

Michael Clark

unread,
Oct 31, 2017, 3:40:08 PM10/31/17
to Rogier Brussee, RISC-V ISA Dev, Clifford Wolf, Allen Baum, jcb6...@gmail.com, br...@hoult.org


> On 1/11/2017, at 8:34 AM, Michael Clark <michae...@mac.com> wrote:
>
>> On 1/11/2017, at 3:39 AM, Rogier Brussee <rogier....@gmail.com> wrote:
>>
>> BCTZ[W] rd,rs1 ???? BREV rd rs1; BCLZ rd, rd
>
> Count Trailing Zeros is used frequently for a fast log2 of an integer. I come across it relatively often.

Log2 of the lowest set bit to be more precise or used under the constraint that the input is a power of two, but you get what I mean. e.g.

assert(n == (n & -n)); /* is power of two */
int log2n = ctz(n);

Tommy Thorn

unread,
Oct 31, 2017, 5:34:46 PM10/31/17
to Michael Clark, Rogier Brussee, RISC-V ISA Dev, Clifford Wolf, Allen Baum, Jacob Bachmeyer, Bruce Hoult
TL;DR: quantify quantify quantify

This is a long thread to comment on, but I'd like to weigh in on a
fundamental point:

(2) This is the B extension. By definition, the features added by it are
aimed at bit manipulation code and will not be benefiting other code a lot.
So discussing what general non-bitmanip code needs is a strawman.

It's clear that there is a significant disagreement over the criterium
for what goes in B.  It's not helpful for any individual to assume that
his criterium is the same as that of everyone else.  I don't agree
that a strawman is what's most urgently needed.

The notion of "other code" is fundamentally subjective.  The way out
of this rat hole is to be very explicit about the workloads and quantify
the performance impact on those (obviously more work than just arguing
for a preference).  Some random examples: Linux kernel, SPEC benchmark,
ssh, OpenSSL, "TLS 1.3", x264, Mozilla Firefox, etc.  These might not be
the best examples, but they are examples of software for which the
number of cycles spent over the number of users is significant enough.

It should be relatively easy to quantify, fx. the bandwidth impact on
ssh of having rotates (chacha20-poly1305 needs them) and it's
far more likely to get implementors to accept a rotate instruction
over a full AES accelerator.

Similarly, it ought to be feasible to quantify the bandwidth impact
on the Linux stack of having byte swap instructions or not.

It's not hard to find applications of the CTZ/CLZ/POPCOUNT instructions,
but I have yet to see actual data from mainstream application
of the impact of these instructions. Nothing is free, and even
application that might use them might still be worse off with
them if it has any impact on cycle time/frequency.

Finally, I'd love to be educated on the bit scatter/gather instructions.
Can anyone point to any mainstream application that
would actually benefit from this?  Without a significant
rewrite?

Thanks,
Tommy

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

McCrary, Rex

unread,
Oct 31, 2017, 6:23:13 PM10/31/17
to Tommy Thorn, Michael Clark, Rogier Brussee, RISC-V ISA Dev, Clifford Wolf, Allen Baum, Jacob Bachmeyer, Bruce Hoult

Everybody take a deep breath….  Good….   I have been traveling on business out of the country, so I am behind on getting the next  Draft of the spec out.  It will not take into account all the discussion the last four days, but the next one will in time for the workshop. 

 

--Rex

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

 

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.


To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Rogier Brussee

unread,
Nov 1, 2017, 11:51:57 AM11/1/17
to RISC-V ISA Dev, michae...@mac.com, rogier....@gmail.com, clif...@clifford.at, allen...@esperantotech.com, jcb6...@gmail.com, br...@hoult.org


Op dinsdag 31 oktober 2017 22:34:46 UTC+1 schreef Tommy Thorn:
TL;DR: quantify quantify quantify

for a preference).  Some random examples: Linux kernel, SPEC benchmark,
ssh, OpenSSL, "TLS 1.3", x264, Mozilla Firefox, etc.  These might not be
the best examples, but they are examples of software for which the
number of cycles spent over the number of users is significant enough.

Very much agreed. I think that in fact knowing how often the bit instructions on x86 
and/or arm are used statically and dynamically should actually already give
useful approximation of the importance of different instructions.  

 
On Tue, Oct 31, 2017 at 12:39 PM, Michael Clark <michae...@mac.com> wrote:


> On 1/11/2017, at 8:34 AM, Michael Clark <michae...@mac.com> wrote:
>
>> On 1/11/2017, at 3:39 AM, Rogier Brussee <rogier....@gmail.com> wrote:
>>
>> BCTZ[W] rd,rs1                     ????                                                    BREV rd rs1; BCLZ rd, rd
>
> Count Trailing Zeros is used frequently for a fast log2 of an integer. I come across it relatively often.

Log2 of the lowest set bit to be more precise or used under the constraint that the input is a power of two, but you get what I mean. e.g.

        assert(n == (n & -n)); /* is power of two */
        int log2n = ctz(n);

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Corey Richardson

unread,
Nov 1, 2017, 3:33:40 PM11/1/17
to isa...@groups.riscv.org
FWIW, ~all serious chess engines benefit heavily from CLZ/POPCNT, at at
least stockfish uses PEXT to good effect. Might be a decent benchmark?
Certainly does a lot of bit manipulation :)

--
cmr
http://octayn.net/
+16038524272

Jacob Bachmeyer

unread,
Nov 1, 2017, 7:06:47 PM11/1/17
to Rogier Brussee, RISC-V ISA Dev, clif...@clifford.at, allen...@esperantotech.com, michae...@mac.com, br...@hoult.org
Four mask shifts were proposed: SLM (shift left mask), SLMI (shift left
mask immediate), SRM (shift right mask), SRMI (shift right mask
immediate). The symmetry combines with the proposed encoding (set bit29
to select mask shift) to minimize implementation cost (replace hardwired
zero with bit29 in the shifter). I also proposed encoding rotate as
setting both bits 29 and 30 (since "arithmetic mask shift" is otherwise
nonsensical) and, more recently, setting bit30 on SLL/SLLI to encode
GREV/GREVI, since "arithmetic left shift" is otherwise nonsensical and
GREV nicely fits that corner of the encoding space, leaving bits 28 and
27 for selecting future extensions, thus neatly packing existing
instructions around shift into exactly 1/4 of the available space (bit28
== bit27 == zero). (That GREV's "k" parameter has the same domain as
shift width helps here.)


-- Jacob

Jacob Bachmeyer

unread,
Nov 1, 2017, 7:11:20 PM11/1/17
to Rogier Brussee, RISC-V ISA Dev, allen...@esperantotech.com, clif...@clifford.at
Or one letter fewer: BGTH and BSCT, which fits with the other
four-letter mnemonics.


-- Jacob

Jacob Bachmeyer

unread,
Nov 1, 2017, 7:16:50 PM11/1/17
to Clifford Wolf, Michael Clark, Allen Baum, RISC-V ISA Dev
Apologies if I came off gruff, but this thread started with Michael
Clark's complaint that there have been no public updates from the
BitManip group.


-- Jacob

Allen J. Baum

unread,
Nov 1, 2017, 7:42:31 PM11/1/17
to jcb6...@gmail.com, Rogier Brussee, RISC-V ISA Dev, clif...@clifford.at, michae...@mac.com, br...@hoult.org
At 6:06 PM -0500 11/1/17, Jacob Bachmeyer wrote:
>>...Opcode application(s)
>>
>>RLL[W] rd,rs1,rs2 Cryptography, fusion
>>RRL[W] rd,rs1,rs2
>>RLLI[W] rd,rs1,shamt
>>BCLZ[W] rd,rs1 softfloat
>>BCTZ[W] rd,rs1 ????
>>BCNT[W] rd,rs1 malloc, bitmaps
>>BREV[W] rd,rs1 reverse addresses, bitmaps??
>>BSWAP[W] rd,rs1 be <-> le e.g. networking,
>>GREVI[W] rd rs1 shamt see BSWAP, BREV bitpermutations
>>BGATH[W] rd,rs1,rs2 instruction decoding in traps, bitmaps masking and shifting,...
>>BSCAT[W] rd,rs1,rs2 bitmaps
>>ANDNOT rd rs1 rs2 clear bit, fusion
>>ORNOT rd rs1 rs2 masking, fusion
>>SLLM rd rs1 rs2 constructing masks, fusion
>>SLLMI rd rs1 shamt same
>
>..details about encoding...

I'm a little sceptical about the utility of a bunch of these
ORNOT - I can see an easy usecase for ANDNOT, but except for orthgonality, can't see a use case
GREVI - I can see easy, nearly free implementations for cases where Imm=2^N, but not much beyond that.
I think they could be replaced either with the BSWAP/BREV/BREVH ops, or a GREVI where the
Immediate is restricted to the form 2^N (and traps for other values) amd removing the Bxxx versions.
I'm also a little sceptical about the form of the SLLMx ops.
The typical use case for masking would have rs1==0; restricing it to that would cut its opcode footprint.
Can anyone come up with cases that are different? Other cases have an easy instruction sequence replacement
SLL rd2,rs1,IMM; SLLM rd,imm; OR rd, rd2, rd
By and large, these are tweaks, but the effort saved adds up in terms of power/area/ validation time! and future expandibility
Can anyone come up with use cases beyond field isolation and muxing that could using SLLMx with a non-zero source?

Jacob Bachmeyer

unread,
Nov 1, 2017, 11:02:36 PM11/1/17
to Allen J. Baum, Rogier Brussee, RISC-V ISA Dev, clif...@clifford.at, michae...@mac.com, br...@hoult.org
Allen J. Baum wrote:
> At 6:06 PM -0500 11/1/17, Jacob Bachmeyer wrote:
>
>>> ...Opcode application(s)
>>>
>>> RLL[W] rd,rs1,rs2 Cryptography, fusion
>>> RRL[W] rd,rs1,rs2
>>> RLLI[W] rd,rs1,shamt
>>> BCLZ[W] rd,rs1 softfloat
>>> BCTZ[W] rd,rs1 ????
>>> BCNT[W] rd,rs1 malloc, bitmaps
>>> BREV[W] rd,rs1 reverse addresses, bitmaps??
>>> BSWAP[W] rd,rs1 be <-> le e.g. networking,
>>> GREVI[W] rd rs1 shamt see BSWAP, BREV bitpermutations
>>> BGATH[W] rd,rs1,rs2 instruction decoding in traps, bitmaps masking and shifting,...
>>> BSCAT[W] rd,rs1,rs2 bitmaps
>>> ANDNOT rd rs1 rs2 clear bit, fusion
>>> ORNOT rd rs1 rs2 masking, fusion
>>> SLLM rd rs1 rs2 constructing masks, fusion
>>> SLLMI rd rs1 shamt same
>>>
>> ..details about encoding...
>>
>
> I'm a little sceptical about the utility of a bunch of these
> ORNOT - I can see an easy usecase for ANDNOT, but except for orthgonality, can't see a use case
>

The hardware cost for ORNOT should be included in the cost of ANDNOT, so
why not have the orthogonality?

> GREVI - I can see easy, nearly free implementations for cases where Imm=2^N, but not much beyond that.
> I think they could be replaced either with the BSWAP/BREV/BREVH ops, or a GREVI where the
> Immediate is restricted to the form 2^N (and traps for other values) amd removing the Bxxx versions.
>

I suggest that BREV/BSWAP* be pseudoinstructions for GREVI with
particular immediate values and that implementations be allowed to
subset any part of GREVI in hardware and rely on supervisor or monitor
traps, provided that such reliance is documented.

> I'm also a little sceptical about the form of the SLLMx ops.
> The typical use case for masking would have rs1==0; restricing it to that would cut its opcode footprint.
> Can anyone come up with cases that are different? Other cases have an easy instruction sequence replacement
> SLL rd2,rs1,IMM; SLLM rd,imm; OR rd, rd2, rd
> By and large, these are tweaks, but the effort saved adds up in terms of power/area/ validation time! and future expandibility
> Can anyone come up with use cases beyond field isolation and muxing that could using SLLMx with a non-zero source?
>

My argument for the full orthogonal set of SLM/SLMI/SRM/SRMI is that I
expect the full set to be simpler to implement than any subset -- the
proposed encoding allows simply using bit29 instead of hardwired zero in
the shifter. (Which is the reason I proposed that encoding.)



-- Jacob

Michael Clark

unread,
Nov 2, 2017, 12:17:04 AM11/2/17
to jcb6...@gmail.com, Clifford Wolf, Allen Baum, RISC-V ISA Dev
Well I don’t think I was exactly complaining. Apologies if it appeared that way. It is up to the BitManip group as to how they wish to communicate with the RISC-V community at large however it would be nice for those not directly involved in the working group to get some progress updates and open discussion can often be fruitful even if it can appear somewhat noisy. I’m certainly a little wiser after seeing how GREVI can be used to implement BREV and BSWAP pseudos with different k values.

I guess I was primarily posting an independent RFC on ISA dev for a proposed set of BitManip instructions, primarily the ones that the compiler already can emit (through builtins and lifts for rotate/bswap) so that I could perform some of my own analysis on retired instruction counts for some benchmarks I am working with. I had included 2 additional instructions that I had seen Clifford post on Twitter about (BEXT/BDEP) and then I finally added GREVI to the list I am independely maintaining after understanding how GREVI can be used. This page has been up since Oct 2016:

- https://github.com/rv8-io/rv8/commits/master/doc/src/bmi.md

I’m most interested in the first 9 instructions in the list I posted earlier as they have immediate utility given there are existing compiler builtins or lifts and there is a fair amount of existing code that can use them immediately. I’m really mostly interested in whether someone already has a draft encoding and patched binutils/gcc so we can gather some concrete stats. It’s easier to see how these instructions affect performance by adding them to the compiler and running tests, which means we have to include instructions that are being considered (including W variants on RV64).

I still maintain that it would be nice to keep the “work done per instruction” as high as possible. This is not necessarily at odds with RISC. I think width is a reasonable thing to encode given its available on competing architectures. It’s a very small edge nevertheless it is an edge. Many small edges combine together to give one a performance advantage/disadvantage. It would be interesting to quantify omitting W variants on RV64.

Allen J. Baum

unread,
Nov 2, 2017, 1:10:09 PM11/2/17
to jcb6...@gmail.com, Rogier Brussee, RISC-V ISA Dev, clif...@clifford.at, michae...@mac.com, br...@hoult.org
At 10:02 PM -0500 11/1/17, Jacob Bachmeyer wrote:
>
>The hardware cost for ORNOT should be included in the cost of ANDNOT, so why not have the orthogonality?

The hardware cost is likely inconsequential - we agree.
It's all the second-order costs: primarily validation, compiler support costs, testing costs, documentation costs... Every little bit hurts, more than you think, and its on ongoing cost that never goes away.
But also opportunity costs: yet another opcode that can't be used in the future - and everything you add is something you'll need to support forever, even if it turns out to be dead weight.


>>GREVI - I can see easy, nearly free implementations for cases where Imm=2^N, but not much beyond that.
>> I think they could be replaced either with the BSWAP/BREV/BREVH ops, or a GREVI where the
>> Immediate is restricted to the form 2^N (and traps for other values) amd removing the Bxxx versions.
>>
>
>I suggest that BREV/BSWAP* be pseudoinstructions for GREVI with particular immediate values and that implementations be allowed to subset any part of GREVI in hardware and rely on supervisor or monitor traps, provided that such reliance is documented.

I could live with that.

>
>>I'm also a little sceptical about the form of the SLLMx ops.
>>The typical use case for masking would have rs1==0; restricing it to that would cut its opcode footprint.
>>Can anyone come up with cases that are different? Other cases have an easy instruction sequence replacement
>> SLL rd2,rs1,IMM; SLLM rd,imm; OR rd, rd2, rd
>>By and large, these are tweaks, but the effort saved adds up in terms of power/area/ validation time! and future expandibility
>>Can anyone come up with use cases beyond field isolation and muxing that could using SLLMx with a non-zero source?
>>
>
>My argument for the full orthogonal set of SLM/SLMI/SRM/SRMI is that I expect the full set to be simpler to implement than any subset -- the proposed encoding allows simply using bit29 instead of hardwired zero in the shifter. (Which is the reason I proposed that encoding.)

Huh - I missed the SRx variants somehow when I reformatted the list.
I was surprised they weren't there, and was going to propose them.
I agree the ops would be useful - but only the R0 variant, which I believe will be the overwhelming majority of cases used - for the same reasons as I would not implement ORNOT above. The cost of encoding either way is inconsequential.
I'd rename them to be MSKGENL and MSKGENR (someone else can hcoose which letters to delete to keep the mnemonic down to 4 characters)

>
>
>
>-- Jacob

Michael Clark

unread,
Nov 2, 2017, 1:37:50 PM11/2/17
to jcb6...@gmail.com, Clifford Wolf, RISC-V ISA Dev


> On 30/10/2017, at 4:18 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Michael Clark wrote:
>> Note to Jacob: the diagram is for an “area efficient” implementation, not a high performance implementation. This is to ameliorate the concern of folk who are area constrained but want to implement B. A high performance implementation could have dedicated circuits for CLZ/CTZ/POPCOUNT if critical path length is an issue. Having separate opcodes is obviously a necessity for high performance implementations as it doesn’t pre-suppose any particular implementation approach, however it is good to know that an area efficient approach that re-uses functional units is possible.
>>
>
> :-)
>
> There are other ALU topologies possible as well. Your earlier message just seemed cavalier about trading area for critical path length.

Hardly.

A barrel shifter is by nature xlen^2 in area and xlen in depth.

The block structure I proposed is ~ 2log2(xlen) in depth; given grevi and popcount are log2(xlen) in depth plus the and+adder (set trailing zeros to ones) which is negligible:

- https://github.com/rv8-io/rv8/blob/master/doc/src/bitmanip.png

I think it is an elegant solution. It would need a constant offset shifter (depth 1) and some masking to handle W variants. Much less power hungry than using the shifter to perform a constant offset shift. Given dark silicon constraints, it would be provably lower energy to use additional area to implement the W variants.

Bruce Hoult

unread,
Nov 2, 2017, 2:47:26 PM11/2/17
to Michael Clark, Jacob Bachmeyer, Clifford Wolf, RISC-V ISA Dev
On Thu, Nov 2, 2017 at 8:37 PM, Michael Clark <michae...@mac.com> wrote:
A barrel shifter is by nature xlen^2 in area and xlen in depth.

Wait, what?

A shift/rotate unit with constant delay is normally done by a stage that can optionally shift/rotate by 1 position, then a stage that can shift by 2 bits, then a stage that can shift by 4 bits, then stages that can shift by 8, 16, 32, 64 bits...

Each stage is directly controlled (pass through unchange, or shift) by one bit from the shift amount.

The delay is proportional to log(xlen) and the area to xlen*log(xlen).

Michael Clark

unread,
Nov 2, 2017, 3:14:24 PM11/2/17
to Bruce Hoult, Jacob Bachmeyer, Clifford Wolf, RISC-V ISA Dev
My bad. I was thinking of a cross-bar barrel shifter vs a cascade shifter.

So the delay of GREVI+POPCOUNT would be approximately double that of a cascade shifter.

Michael Clark

unread,
Nov 2, 2017, 4:05:35 PM11/2/17
to Bruce Hoult, Jacob Bachmeyer, Clifford Wolf, RISC-V ISA Dev
Considerations for small area implementations aside…

If critical path is an issue i.e. performance (assuming the Bitmanip unit is pipelined), then an ISA that defines individual opcodes and widths is going to have the most performance potential, with the presumption that only the most sophisticated micro-architectures can perform macro-op fusion.

Jacob is saying on one hand that critical path is an issue and on the other a design that mandates additional shift instructions is acceptable. I’m looking at real-life code and I see BSWAP32/CTZ32/CLZ32 on 32-bit integers on RV64. People that write code use int often enough. intptr_t is a relatively new type and a lot of this has only transpired over the last fifteen years on Linux since the big migration from 32-bit to 64-bit. A lot of code that runs on 64-bit is still optimised for 32-bit and does not have the additional level of complexity of using intptr_t and adjusting stride. In many cases the logical quantum is 32-bits.

Handling width in the bitmanip functional unit will be an order of magnitude simpler than implementing macro-op fusion. Not encoding W variants punishes all RV64 implementations without macro-op fusion.

In fact I’m fine with macro-op fusion for the binary translation case as we can fuse them because the targets we are translating to have both 32-bit and 64-bit variants. So I don’t mind, however we should define the W instructions as macro-op fusion pairs if we go that way, so that the compiler can schedule them appropriately. Also the second instruction needs to kill the temporarily shifted result i.e. use the result register to hold the shifted temporary.

The question is really whether RV64 implementors want maximal work per instruction/micro-op at the expense of 1 bit of encoding space for W variants. A sophisticated implementation with macro-op fusion is going to have a distinct micro-op either way.

In any case, it would be unwise to not include W variants in the study set of instructions. so that we can quantity retired instruction counts. From my own experience, shaving cycles out of a critical loop result in measurable performance improvements. The optimisation metaphor is “death by a thousand cuts” for want of a better metaphor [1]. i.e. the result of many small optimisations add up.

[1] https://en.wikipedia.org/wiki/Lingchi

kr...@berkeley.edu

unread,
Nov 2, 2017, 4:44:25 PM11/2/17
to Bruce Hoult, Michael Clark, Jacob Bachmeyer, Clifford Wolf, RISC-V ISA Dev
No - you didn't count the wires. It's quadratic even in a log shifter.

Krste



| --
| You received this message because you are subscribed to the Google Groups
| "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email
| to isa-dev+u...@groups.riscv.org.
| To post to this group, send email to isa...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/
| .
| To view this discussion on the web visit https://groups.google.com/a/
| groups.riscv.org/d/msgid/isa-dev/
| CAMU%2BEkyQA_Mpu6%3DtQswgu%3D-d_d77f%2BjuYAF6-CeNBKhUB039bA%40mail.gmail.com.

Bruce Hoult

unread,
Nov 2, 2017, 5:50:55 PM11/2/17
to Krste Asanovic, Michael Clark, Jacob Bachmeyer, Clifford Wolf, RISC-V ISA Dev
On Thu, Nov 2, 2017 at 11:44 PM, <kr...@berkeley.edu> wrote:

>>>>> On Thu, 2 Nov 2017 21:47:22 +0300, Bruce Hoult <br...@hoult.org> said:

| On Thu, Nov 2, 2017 at 8:37 PM, Michael Clark <michae...@mac.com> wrote:
|     A barrel shifter is by nature xlen^2 in area and xlen in depth.


| Wait, what?

| A shift/rotate unit with constant delay is normally done by a stage that can
| optionally shift/rotate by 1 position, then a stage that can shift by 2 bits,
| then a stage that can shift by 4 bits, then stages that can shift by 8, 16, 32,
| 64 bits...

| Each stage is directly controlled (pass through unchange, or shift) by one bit
| from the shift amount.

| The delay is proportional to log(xlen) and the area to xlen*log(xlen).

No - you didn't count the wires.  It's quadratic even in a log shifter.

Ok, I don't get it.

Each stage is (assuming pure rotate or shift one way only) xlen 2:1 muxes with a 2:1 fanout. The NUMBER of wires is clearly 2xlen times the number of stages, which is log(xlen).

Or .. are you saying that the shift-by-32 stage needs to be physically placed twice as far from the shift-by-16 stage as the shift-by-16 stage is from the shift-by-8 stage, purely so the wires don't have to run at a crazy acute angle?

Jacob Bachmeyer

unread,
Nov 2, 2017, 7:02:51 PM11/2/17
to Allen J. Baum, Rogier Brussee, RISC-V ISA Dev, clif...@clifford.at, michae...@mac.com, br...@hoult.org
Allen J. Baum wrote:
> At 10:02 PM -0500 11/1/17, Jacob Bachmeyer wrote:
>
>>> I'm also a little sceptical about the form of the SLLMx ops.
>>> The typical use case for masking would have rs1==0; restricing it to that would cut its opcode footprint.
>>> Can anyone come up with cases that are different? Other cases have an easy instruction sequence replacement
>>> SLL rd2,rs1,IMM; SLLM rd,imm; OR rd, rd2, rd
>>> By and large, these are tweaks, but the effort saved adds up in terms of power/area/ validation time! and future expandibility
>>> Can anyone come up with use cases beyond field isolation and muxing that could using SLLMx with a non-zero source?
>>>
>> My argument for the full orthogonal set of SLM/SLMI/SRM/SRMI is that I expect the full set to be simpler to implement than any subset -- the proposed encoding allows simply using bit29 instead of hardwired zero in the shifter. (Which is the reason I proposed that encoding.)
>>
>
> Huh - I missed the SRx variants somehow when I reformatted the list.
> I was surprised they weren't there, and was going to propose them.
>

Good; we agree there.

> I agree the ops would be useful - but only the R0 variant, which I believe will be the overwhelming majority of cases used - for the same reasons as I would not implement ORNOT above. The cost of encoding either way is inconsequential.
> I'd rename them to be MSKGENL and MSKGENR (someone else can hcoose which letters to delete to keep the mnemonic down to 4 characters)
>

Perhaps I misunderstand, but I believe that limiting mask shifts to
using x0 as source would actually be more complex to implement than
introducing full mask shifts with the proposed encoding, since the
instruction decoder would have to check that the source is x0. Further,
(IIRC) these would be the first full-size instructions in RISC-V not
fully general in register usage and GREV, rotate, and mask shifts can
all fit in the same (additional) amount of encoding space that shifts
use already.

If bits 30~27 are considered the function space for shift, the baseline
shifts use 2/16 of that space (logical and arithmetic shifts), GREV
fills a 1/16 hole (nonsensical "left arithmetic shift"), mask shifts add
1/16 (bit29 to select mask shift), and rotate fits in the combination of
bit30 and bit29 both set (nonsensical "arithmetic mask shift") filling a
1/16 gap left by mask shifts when considering the shift "function codes"
as a flag field (bit30 == "arithmetic shift", bit29 == "mask shift")
instead of fully decoding it to simplify hardware. Bits 28 and 27
remain reserved for future extension, packing all shifts, rotates and
GREV into 1/4 of the possible encoding space here. Bit 31 also remains
available for future extension, but has high fanout due to sign
extension of immediates, so is left reserved in this proposal.



-- Jacob

Allen Baum

unread,
Nov 2, 2017, 7:08:52 PM11/2/17
to Michael Clark, jcb6...@gmail.com, Clifford Wolf, RISC-V ISA Dev
Depending on how precisely you define a barrel shifter, it can take area XLEN*log2(XLEN)- that would be 6 layers of 2in Muxes - and you can build (the useful) subset of GREV into it a almost no cost.

-Allen

Jacob Bachmeyer

unread,
Nov 2, 2017, 7:27:09 PM11/2/17
to Michael Clark, Bruce Hoult, Clifford Wolf, RISC-V ISA Dev
Michael Clark wrote:
>> On 3/11/2017, at 8:14 AM, Michael Clark <michae...@mac.com> wrote:
>>
>>> On 3/11/2017, at 7:47 AM, Bruce Hoult <br...@hoult.org> wrote:
>>>
>>> On Thu, Nov 2, 2017 at 8:37 PM, Michael Clark <michae...@mac.com> wrote:
>>> A barrel shifter is by nature xlen^2 in area and xlen in depth.
>>>
>>> Wait, what?
>>>
>>> A shift/rotate unit with constant delay is normally done by a stage that can optionally shift/rotate by 1 position, then a stage that can shift by 2 bits, then a stage that can shift by 4 bits, then stages that can shift by 8, 16, 32, 64 bits...
>>>
>>> Each stage is directly controlled (pass through unchange, or shift) by one bit from the shift amount.
>>>
>>> The delay is proportional to log(xlen) and the area to xlen*log(xlen).
>>>
>> My bad. I was thinking of a cross-bar barrel shifter vs a cascade shifter.
>>
>> So the delay of GREVI+POPCOUNT would be approximately double that of a cascade shifter.
>>
>
> Considerations for small area implementations aside…
>
> If critical path is an issue i.e. performance (assuming the Bitmanip unit is pipelined), then an ISA that defines individual opcodes and widths is going to have the most performance potential, with the presumption that only the most sophisticated micro-architectures can perform macro-op fusion.
>
> Jacob is saying on one hand that critical path is an issue and on the other a design that mandates additional shift instructions is acceptable. I’m looking at real-life code and I see BSWAP32/CTZ32/CLZ32 on 32-bit integers on RV64. People that write code use int often enough. intptr_t is a relatively new type and a lot of this has only transpired over the last fifteen years on Linux since the big migration from 32-bit to 64-bit. A lot of code that runs on 64-bit is still optimised for 32-bit and does not have the additional level of complexity of using intptr_t and adjusting stride. In many cases the logical quantum is 32-bits.
>

Depending on the cost of those shift instructions and how much avoiding
them adds to the critical path length, such a design *may* be optimal.
If adding W variants reduces the maximum operating frequency, a system
may be faster overall without them.

> Handling width in the bitmanip functional unit will be an order of magnitude simpler than implementing macro-op fusion. Not encoding W variants punishes all RV64 implementations without macro-op fusion.
>

I do not expect a single "bitmanip functional unit", but operations
integrated into other functional units. For example, the mask shifts I
propose are intended to be performed by the baseline shifter. Rotates
may vary with hardware designs, but at least some topologies can use the
same functional unit for both shift and rotate. A cascade shifter with
one extra set of mux inputs can also implement GREV. Incremental area
cost is small and the effective critical path length is very slightly
increased due to the lower speed of the larger multiplexers. If the
shifter is not actually on the ALU critical path, GREV is "free".

> In fact I’m fine with macro-op fusion for the binary translation case as we can fuse them because the targets we are translating to have both 32-bit and 64-bit variants. So I don’t mind, however we should define the W instructions as macro-op fusion pairs if we go that way, so that the compiler can schedule them appropriately. Also the second instruction needs to kill the temporarily shifted result i.e. use the result register to hold the shifted temporary.
>
> The question is really whether RV64 implementors want maximal work per instruction/micro-op at the expense of 1 bit of encoding space for W variants. A sophisticated implementation with macro-op fusion is going to have a distinct micro-op either way.
>

Additional micro-ops can have far-reaching trade-offs in hardware. A
hardware CLZW opcode forces all implementations to have a CLZW micro-op,
while a "SLMI / CLZ" sequence can be fused in implementations that have
CLZW and executed literally otherwise.

> In any case, it would be unwise to not include W variants in the study set of instructions. so that we can quantity retired instruction counts. From my own experience, shaving cycles out of a critical loop result in measurable performance improvements. The optimisation metaphor is “death by a thousand cuts” for want of a better metaphor [1]. i.e. the result of many small optimisations add up.
>
> [1] https://en.wikipedia.org/wiki/Lingchi.

What? The incremental encoding cost for W variants of instructions in
OP and OP-IMM is nil: operations in those major opcodes reserve
corresponding space in OP-32 and OP-IMM-32 (and OP-64 and OP-IMM-64 in
RV128). However we encode CLZ in OP, CLZW is analogous in OP-32. There
is no expense in encoding space, since that space is already reserved
for this purpose.

The only cost for narrower variants of these instructions is hardware
complexity and these are complex trade-offs. By all means, consider W
variants at the compiler level at this time. We are still gathering
information and may find actual uses for W variants that really do work
with 32-bit data and are not merely artifacts of incomplete porting to
64-bit architectures. Even if the W variants ultimately end up as
assembler macros, we should still consider them now.



-- Jacob

Allen J. Baum

unread,
Nov 2, 2017, 8:14:48 PM11/2/17
to Bruce Hoult, Krste Asanovic, Michael Clark, Jacob Bachmeyer, Clifford Wolf, RISC-V ISA Dev
It is true that a shifter that is made up of 6 stages of 2in muxes will, at first glance, have area is proportional to Xlen^2 - except that the horizontal wires at each stage (e.g. 32 wires across the datapath in the stage that shifts by 32) are pretty tightly packed, so there's a constant term in there which is small. In addition, those wires can overlap the mux logic cells to some degree.
So the area is really (very roughly)  proportional to xlen*(xlen-M*log2(xlen)), where M is the degree of overlap
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Erich Plondke

unread,
Nov 9, 2017, 11:25:56 AM11/9/17
to RISC-V ISA Dev, clif...@clifford.at, allen...@esperantotech.com, michae...@mac.com, jcb6...@gmail.com, br...@hoult.org

On Tuesday, October 31, 2017 at 9:39:51 AM UTC-5, Rogier Brussee wrote:


Op dinsdag 31 oktober 2017 14:22:18 UTC+1 schreef Bruce Hoult:
On Tue, Oct 31, 2017 at 3:23 PM, Clifford Wolf <clif...@clifford.at> wrote:
(2) This is the B extension. By definition, the features added by it are
aimed at bit manipulation code and will not be benefiting other code a lot.
So discussing what general non-bitmanip code needs is a strawman.

They should still be useful on most high volume general purpose computing devices. 
Things useful for, say, secure communications over networks (while minimising power consumption) would I suggest fall under the bar.

F.W.I.W.  I mostly agree but the problem is of course that everybody thinks its own applications are super important. 
I would also add that things that are common in embedded use would seem reasonable.  Last reason I can think of is that it opens
up useful opportunities for fusing instructions and the instructions themselves are cheap.

So I think it would be really useful to have a list of important applications going with the instructions and alternatives to cater for
these instructions! Here is my attempt using my memory of what I have seen pass by and happen to know. YMMV.


Opcode                                application(s)                                       Alternatives 

RLL[W] rd,rs1,rs2                 Cryptography, fusion                                
RRL[W] rd,rs1,rs2                 same                                                   ADDI rd, rs2, -XLEN; RRL[w] rd, rs1, rd                          

RLLI[W] rd,rs1,shamt            same 
RRLI[W] rd,rs1,shamt           same                                                   RLLI rd, rs1, (XLEN -shamt)
BCLZ[W] rd,rs1                     softfloat                                               

BCTZ[W] rd,rs1                     ????                                                    BREV rd rs1; BCLZ rd, rd                                             
BCNT[W] rd,rs1                     malloc, bitmaps
BREV[W] rd,rs1                     reverse addresses, bitmaps??            GREVI rd, (XLEN -1)

BSWAP[W] rd,rs1                  be <-> le e.g. networking,                   GREVI rd, (XLEN - 8) (?)

GREVI[W] rd rs1 shamt        see BSWAP, BREV bitpermutations   BSWAP, BREV  instructions for common case, trapping for others     
BGATH[W] rd,rs1,rs2            instruction decoding in traps, bitmaps  masking and shifting, instructions or csr's that directly "parse" the trapping instruction into (instr, rd, rs1, rs2, rs3, imm/shamt)??????. 
BSCAT[W] rd,rs1,rs2            bitmaps                                                 masking and shifting.
 
ANDNOT rd rs1 rs2             clear bit, fusion                                      XORI rd, rs2, -1; AND rd, rd, rs1,  (possibly extending C extension with not instruction)
ORNOT    rd rs1 rs2             masking, fusion                                    XORI rd, rs2, -1; OR rd, rd, rs1,     
SLLM        rd rs1 rs2           constructing masks, fusion                    XORI rd, rs1, -1; SLL rd, rd, rs2; XORI rd, rd, -1
SLLMI       rd rs1 shamt       same                                                     XORI rd, rs1, -1; SLLI rd, rd, shamt; XORI rd, rd, -1
 
This table is very useful.

Bit reversed addressing is very important for FFT (and friends).

CLZ is very important in signal processing & FP emulation, as pointed out.  We also use a lot of "count leading bits", which is count leading zeros or ones, whichever is there.  You can form it by counting zeros and counting the zeros of the complement and adding them / oring them, or taking the absolute value of the input before CLZ.  

The result of CLZ(#0) is sometimes desirable to have as XLEN, sometimes desirable to have as 0.  Choosing 0 (using only the log2(XLEN) bits) makes it easier to implement CLZ of lower sizes as shift left and CLZ.

Popcount is important for a variety of signal processing things also, and (while overkill) for finding parity.  

Bit reverse, Popcount, and Count Leading are the instructions that -- when they are needed -- it is extremely painful to implement without hardware support.


Reply all
Reply to author
Forward
0 new messages