Chess engines like Stockfish do a lot of bit ops.It's not much of an application, but as microbenchmarks go you might grab some of the examples from and see if they can be accelerated https://graphics.stanford.edu/~seander/bithacks.htmlOn Sat, Feb 4, 2017, at 09:12, 黃柏瑋 wrote:> Hi all,> Recently, bit manipulation task group and I are searching for target application of bit manipulation isa for benchmarking.> I’ve spend some time collecting and raise a poll in the RISC-V@Taiwan FB group. ( https://www.facebook.com/groups/riscv.tw/ <https://www.facebook.com/groups/riscv.tw/>.)
> However, I thought that we might need more input, as we need to cover a wide range of application.>> Could anyone suggest some good target application for bit manipulation ISA?> I would pass every suggestion into the group and cite your name.>> Thanks>> Po-wei Huang>>>>> --> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/72704CFA-8515-45A9-A71A-F3BFF61A105C%40gmail.com.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1486238193.2103716.870385256.3B43FEEC%40webmail.messagingengine.com.
- __builtin_bswap32- __builtin_bswap64
- __builtin_popcount (32-bit)- __builtin_popcountll (64-bit)
- __builtin_clz (32-bit)- __builtin_clzll (64-bit)
- __builtin_ctz (32-bit)- __builtin_clzll (64-bit)
- __builtin_ffs (32-bit)- __builtin_ffsll (64-bit)Seehttps://en.wikipedia.org/wiki/Find_first_set
On 5 Feb 2017, at 1:51 PM, Andrew Waterman <and...@sifive.com> wrote:
On Saturday, 4 February 2017 19:53:26 PST Jacob Bachmeyer wrote:
> Andrew Waterman wrote:
> > Tommy Thorn's original `B' proposal (POPC + BREV + BSWAP) was
> > attractive because it only included instructions that are both useful
> > and expensive to synthesize from `I' instructions.
>
> [apologies in advance for the stream-of-consciousness style; the most
> important bit is after the "SLM" heading]
>
> How do EXTRACT-FIELD and WRITE-FIELD fit into this? On one hand, these
> can be synthesized from AND/OR and shifts, but on the other hand,
> synthesizing them requires forming a mask in a register, which requires
> multiple steps.
>
> EXTRACT-FIELD(source, offset, count) :=
> ((source>>offset)&((1<<(count+1))-1))
>
> WRITE-FIELD(source, offset, count, value) :=
> (((~(((1<<(count+1))-1)<<offset))&source)|((value&((1<<(count+1))-1))<<offse
> t))
EXTRACT-FIELD seems to me to be a very inefficiently encoded bit extraction instruction.
It seems much more effective to me to have EXTRACT(source, mask), which "pulls" the bits under the ones in the mask to "one side" of the output, while the bits under zeroes go to the other side. Bit ordering is maintained within the two groups.
This permits extracting multiple fields simultaneously, so long as they are already in the correct _order_.
If they are not, a second pass will pull a subset of _those_ fields to the side, and so forth.
An example, with 8-bit registers for illustration:
__________12345678____12345678_______368:12457
EXTRACT(0b10101010, 0b00100101) -> 0b100:10011
A DEPOSIT(source, mask) which is exactly the inverse seems to be a natural addition to this.
In addition, the masks are _very_ likely to be constants.
As a result, these _generalize_ EXTRACT-FIELD and WRITE-FIELD, match existing instructions on other architectures better (PDEP and PEXT, from Intel's BMI2), and don't require three logical arguments (which necessitates the encoding proposed below)
As a result, the SLM instructions aren't needed (what they're used to implement is just a trivial case of a more powerful instruction), without encoding weirdness.
> A 12-bit immediate is exactly enough to encode 6-bit offset/count for up> to RV64I, but that would not help RV128, nor would it help the dynamic> case, which is exactly where these instructions are most useful, since> the complicated expressions to form masks can be evaluated at> compile-time if the offset and count are known. However, WRITE-FIELD is> fundamentally a 5-operand instruction; while offset and count can be> combined into a control word, that only reduces it to a 4-operand> instruction (dest <- WRITE-FIELD(source, offset/count, value)).>> Setting the inability to actually encode WRITE-FIELD aside for a moment,> consider encoding EXTRACT-FIELD. It is an R-type instruction, using a> 32-bit control word in a register. To improve performance in the common> case and allow the control word to be loaded using a single instruction,> I propose permuting the control word bits and assigning bits 0:5 as> count[0:5] and bits 6:11 as offset[0:5], with higher bits either ignored> or must-be-zero on RV32/RV64 and bits 12:21 as count[6:15] and bits> 22:31 as offset[6:15]. This assumes that most bitfields will be less> than 32 bits long, but may appear anywhere in a 64-bit word. In the> static case where these assumptions are met, the control word can be> loaded with a single ADDI and sign extension will fill the top bits with> zero. In the dynamic case on RV32/RV64, the control word can be formed> with a SHIFT/OR sequence, which could be a candidate for macro-op> fusion. The process is a bit more complex for RV128, but EXTRACT-FIELD> can itself be used to split the count and offset inputs, with SHIFT/OR> to build the control word.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5166149.CxW9gsFpkr%40arkadios.
(2) The additional hardware burden to support SLM/SLMI (and a corresponding SRM/SRMI "for free") is tiny--as little as a single wire from instruction decode to ALU with no additional gates. The additional hardware required to implement EXTRACT and DEPOSIT is significant.
That said, the new shifts are still useful even with your EXTRACT and DEPOSIT, since they aid in quickly constructing masks in the dynamic case, such as LZMA decoding, where the sizes of various bit fields are parameters.
-- Jacob
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/58969594.8010001%40gmail.com.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/589697F5.2010301%40gmail.com.
I hope everyone is at least familiar with PowerPC rlwimi (Rotate Left Word Immediate and Mask Insert) and Aarch64 bfm (BitField Move).rlwimi has 15 bits of immediate (3x 5 bits for field start, end, and rotate amount) and only addresses 32 bit registers so doesn't fit well in RV.The ARM64 bfm instruction is I think especially nice. 5 bit src and dst register fields and a 12 bit immediate. Plus two bits in the opcode to indicate whether to take the other dst bits from the existing dst contents or from zero register, and whether to sign or zero extend if inserting into zeroes (so only three of the four combinations are permitted).The sign extend feature could be dropped, as that's easy enough to do other ways if needed.Depending on the immediate bits, bfm either moves an arbitrary field to right-aligned, OR a right-aligned value to an arbitrary positiion. The extract operation is also not necessary to emulate, as you can extract an arbitrary field with a left shift followed by a right logical or arithmetic shift (depending on whether you want to zero or sign extend).So the 12 bit immediate could be a simple shift amount & field size pair. Or, better, rotate amount, eliminating the need for a separate rotate instruction.Leaving us with:BFI - Bit Field Insert, andBFIZ - Bit Field Insert Into ZeroHmm .. I just talked myself out of BFIZ, as that can also be done by two shifts.BFI dst, src, #rotate, #size, I format rot[5..0] | size[5..0] | rs1 | funct3 | rd/rs2 | opcodeThe only bad thing about it is there aren't any other integer instructions that need to duplicate the dst field to rs2
At the implementation level, are start&end bit positions better than size?--On Sun, Feb 5, 2017 at 6:11 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:Michael Clark wrote:
- OpenSSL - ciphers and digest algorithms frequently use the rotate primitives
Rotate lacks a compiler intrinsic but can be detected by compilers - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62263
Rotate can also be implemented with SHIFT/SHIFT/OR using one scratch register. Add a fourth opcode that clobbers the scratch register and you have a plausible candidate for macro-op fusion.
ROTLI:
SRLI t0, source, XLEN-count
SLLI dest, source, count
OR dest, dest, t0
??? t0, ...
The trailing ??? indicates that the following instruction must clobber t0 without referencing t0, permitting the SRLI/SLLI/OR sequence to be executed with a single regfile write to dest. This formulation allows a rotate operation to immediately follow another.
-- Jacob
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/589697F5.2010301%40gmail.com.
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkyFsJSsTUQfzNiho%2B4tN7cGy0OYQfHbtV_EB4zzutq51w%40mail.gmail.com.
It's not much of an application, but as microbenchmarks go you might grab some of the examples from and see if they can be accelerated https://graphics.stanford.edu/~seander/bithacks.html
On Sun, Feb 5, 2017 at 2:43 AM Bruce Hoult <br...@hoult.org> wrote:I hope everyone is at least familiar with PowerPC rlwimi (Rotate Left Word Immediate and Mask Insert) and Aarch64 bfm (BitField Move).rlwimi has 15 bits of immediate (3x 5 bits for field start, end, and rotate amount) and only addresses 32 bit registers so doesn't fit well in RV.The ARM64 bfm instruction is I think especially nice. 5 bit src and dst register fields and a 12 bit immediate. Plus two bits in the opcode to indicate whether to take the other dst bits from the existing dst contents or from zero register, and whether to sign or zero extend if inserting into zeroes (so only three of the four combinations are permitted).The sign extend feature could be dropped, as that's easy enough to do other ways if needed.Depending on the immediate bits, bfm either moves an arbitrary field to right-aligned, OR a right-aligned value to an arbitrary positiion. The extract operation is also not necessary to emulate, as you can extract an arbitrary field with a left shift followed by a right logical or arithmetic shift (depending on whether you want to zero or sign extend).So the 12 bit immediate could be a simple shift amount & field size pair. Or, better, rotate amount, eliminating the need for a separate rotate instruction.Leaving us with:BFI - Bit Field Insert, andBFIZ - Bit Field Insert Into ZeroHmm .. I just talked myself out of BFIZ, as that can also be done by two shifts.BFI dst, src, #rotate, #size, I format rot[5..0] | size[5..0] | rs1 | funct3 | rd/rs2 | opcodeThe only bad thing about it is there aren't any other integer instructions that need to duplicate the dst field to rs2rd-as-implicit-source-operand is the main reason conditional-move isn't in the base ISA. I advise that B instructions adhere to the base ISA template, else it will restrict uptake of the B extension
If you are a member of the task group you should have already received this notification. If not, please join the Foundation and the bit manipulation task group. If you are in the process of joining and that process is just not complete, let me know and I will try to make sure you have what you need for tomorrow. I am unable to guarantee it, but I will try..
Rex McCrary
AMD
Bit Manipulation Task Group Chair
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkyzAFns6xw4hv6XfLXczVEfqBoyrXAB1StOdVeAssPv5g%40mail.gmail.com.
Bruce Hoult wrote:
So that leaves such possibilities as helper instructions to:
1) zero out a field in the destination, if needed (it often won't be). This is basically a BIC/ANDNOT of a mask, or AND of the inverse of a mask. (I note in passing that DEC machines such as the PDP11 and VAX had a BIC instruction but didn't have AND).
For RV64I I can't see any way to do this in fewer than five instructions (unless you load a mask from a literal pool, which I assume is undesirable). There are quite a few ways to do it in five, but I think no ways to do it in four without either a rotate or the proposed "shift in 1s" instruction, or BIC.
"Shift in 1s" and a rotate together would permit a three instruction sequence:
slmi t0, zero, #xlen-fieldSize
rli t0, t0, #xlen-fieldStart m # or rri t0, t0, #fieldStart
and dst, dst, t0
rli would fit in nicely with slli, perhaps using the same bit 30 that distinguishes srli/srai and add/sub.
But then the same would apply to encoding slmi :( They can't both have it.
I see a solution here. I proposed using bit 29 to select "shift-in-ones", but bit 30 and bit 29 both set would suggest "arithmetic shift mask" which is nonsensical (and undefined for a left shift). At a slight increase in decode complexity, the combination of bit 30 and bit 29 could represent "rotate". The minor opcode in funct3 already determines left/right. The major opcode already determines whether to use an immediate or an rs2. So adding "shift-in-ones" really adds SLM/SRM/SLMI/SRMI.
My previous statement that there is a large amount of space in funct7 on shifts was incorrect--RV64I adds bit 25 to the shift amount and RV128I can be expected to add bit 26 to the shift amount. That leaves bits 27, 28 and 31 for further "shift-like" operations, but bit 31 has a high fanout. Using bits 30:29==2'b01 to select "shift-in-ones" and bits 30:29==2'b11 to select "rotate" would make efficient use of this encoding space, leaving three bits for future extensions.
A single instruction "Clear field" would be much better.
That runs into the same encoding difficulties as EXTRACT-FIELD, requiring a combined offset/length control word in a register.
I note that Jacob's MASK-FIELD C code and asm code differ -- the C did a ~ on the mask, but the asm didn't.
That was an error in my asm code. There should have been a "NOT t0, t0" just before the AND in the MASK-FIELD example. MASK-FIELD was really a helper operation for WRITE-FIELD and I did not give the same attention to proofreading it as the other two. My mistake.
I was interested to see AMD added ANDN (which I usually call BIC, as I started on PDP11/VAX). I've always liked this instruction, and it would eliminate the NOT you forget in your assembly code :-) I'd like to see ORN as well.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkzea25Zz0KSY%2B6fpKAU8Nr%2B5%3D5AnL9fkRvq2BaUc8cFFw%40mail.gmail.com.
The "shift-in-ones" instruction I propose replaces the load-1/shift-left/subtract-1 sequence and also allows two more single-instruction expansions for LI: If the immediate is a contiguous group of ones on either side, use SLMI/SRMI from x0.
I don't really agree with this about the static case.
Sure, masks of 11 bits or fewer can come directly from a LI and moved anywhere in a 32/64/128 bit register with a shift. masks up to 32 bits can be made with a LUI/ADDI (plus a shift for 64/128).
So you're up to 2 or 3 instructions already and what you want most is the negative mask for clearing the field in the destination if you don't have ANDN, so that needs either another NOT or your "shift in 1s".
The instruction count is rising rapidly, compared to a "clear-field" single instruction.
The compiler can also choose to load the mask from a constant pool--one LOAD instruction for any mask. This is not an option in the dynamic case.
Perhaps there needs to be some general guideline of how much instruction sequences need to be shortened by in order to justify a new instruction?
I think of it as "benefits (shortened programs) versus costs (implementation complexity)". The "shift-in-1s" instructions have almost trivial incremental implementation complexity and the proposed encoding for "shift-in-1s" also leads (by creating an invalid case) to an encoding for rotate operations. I proposed it as a minimal-cost alternative to complex field-access instructions that is generic enough to have other uses. I expect that the hardware cost to implement "shift-in-1s" is less than the cost to implement "rotate".
POPC, CLZ, PDEP, PEXT all replace very long sequences (or even loops). It seems clear that if those operations are sufficiently common to care about at all then either a single instruction or helper instructions are justified. (I'm not convinced the added generality of PDEP & PEXT vs field-extract and field-insert would be used)
POPC and CLZ are also relatively simple; apparently the original "B" proposal derived CLZ from BREV+CTZ and CTZ from POPC. I am not quite sure how and my efforts to find this proposal have come up empty.
PDEP and PEXT are anything but simple in an efficient implementation and I think that they are probably best omitted due to their enormous complexity.
ANDN and ORN are quite trivial and don't shorten programs very much by themselves, but they do open up a lot more possibilities for subsequent 2-instruction fusing.
While these can eliminate scratch registers from some sequences and the encoding atop the existing AND and OR could be very simple (bit 30==invert rs2 if set; bit 29==invert rs1 if set; invert both inputs if both bits are set), this effectively adds an XOR gate delay to the ALU path for bitwise operations.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkza-2mwoy7p2qxCFpmtho1-uaE8xVkUvw0295yPXrB%3DNw%40mail.gmail.com.
If you can tolerate 3 register operands (which an implementation that has fused MulAdd might require)
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DA_1C1%2Bjv5duuGpBZU%2BE5m9yutYc3caMU1CvgK8o%3D8pYQ%40mail.gmail.com.