--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/D8795FF3-462F-4E17-9B36-57F6CDA38919%40mac.com.
On 13 Mar 2016, at 6:32 AM, Samuel Falvo II <sam....@gmail.com> wrote:BGS (Bit Gather Scatter) uses 4 bit packed offsets in a gather scatter mask register (rs2) to gather bits from a specified half or quarter of a source register (rs1) and write bits to a half or quarter of a destination register (rd). Gather bit offsets in the source register are based on the right to left position of the 4 bit packed offset relative to the source half or quarter. Scatter bit offsets in the destination register are based on the packed bit offset value relative to the destination half or quarter.
I've read through this many times, and I'm still thoroughly confused
by your scatter/gather notes. Can you discuss the specific problem
you're trying to solve, and how your proposed solution aims to help?
Can you include example code to help illustrate?
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/640902BD-45D9-499C-A289-EE90F8787B75%40mac.com.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/AED4B26D-2BCE-40C7-B2A6-2018D7C52626%40mac.com.
On 13 Mar 2016, at 9:07 AM, Mike Hamburg <mi...@shiftleft.org> wrote:Is rotate insufficiently orthogonal? It’s used a lot in crypto, but maybe not in anything else?
I’m also confused about SRLM. Is this to avoid a shift right followed by and/andi?
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/61d74fa3-a18d-4c6c-ae09-64c4e31f0ec6%40groups.riscv.org.
One of the wonderful qualities of the RISC-V ISA is the lack of cruft; only
instructions that has quantitive merits are included. As the B-set isn’t
a mandated feature, to maximize the chance of getting it adopted by
implementations, it must have a compelling benefit/cost ratio.
The Princeton paper talks about a method using butterfly networks for handling arbitrary bit permutations but it seems to be very complex,
We’d need the V extension with a 64 element control vector of bytes containing 6-bit offsets to perform an arbitrary bit permutation on a 64-bit word.
Of course there are more compact encodings as once a bit has been selected, there are less remaining bits that can be described if each bit only appears in the destination word once. Clifford Wolf has done the math for this.
A 512-bit (64 x 8bit using only 6bits) vector is quite expensive, however an arbitrary big gather with multiple bit crossings can be specified in one instruction, as per my attempt at a BGS description below (which was working on 16-bit values with 4-bit offsets).
At least we have prior art for VPSHUFBITQMB on the RISC-V mailing list (Mar 2016), which pre-dates Intel’s public document (Oct 2017).
ok so in a separate thread, related to 3D Graphics and Video processing, i mentioned an idea for a Simple-Vector extension which boils down to a single instruction:
But it is *much* easier to convince people to contribute to that when you
have a mandate by being an official task group.
SLO (R-type)
{opcode, funct3, funct7} = {$OP, $SLL, 7'b0010000}
SRO (R-type)
{opcode, funct3, funct7} = {$OP, $SRL, 7'b0010000}
SLOI (I-type)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {7'b0010000, shamt}} (RV32)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {6'b001000, shamt}} (RV64)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {5'b00100, shamt}} (RV128)
SROI (I-type)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {7'b0010000, shamt}} (RV32)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {6'b001000, shamt}} (RV64)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {5'b00100, shamt}} (RV128)
ROL (R-type) (rotates fit as "arithmetic shift ones")
{opcode, funct3, funct7} = {$OP, $SLL, 7'b0110000}
ROR (R-type)
{opcode, funct3, funct7} = {$OP, $SRL, 7'b0110000}
ROLI (I-type) (hardware opcode reserved for symmetry)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {7'b0110000, shamt}} (RV32)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {6'b011000, shamt}} (RV64)
{opcode, funct3, imm12} = {$OP-IMM, $SLL, {5'b01100, shamt}} (RV128)
RORI (I-type)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {7'b0110000, shamt}} (RV32)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {6'b011000, shamt}} (RV64)
{opcode, funct3, imm12} = {$OP-IMM, $SRL, {5'b01100, shamt}} (RV128)
Slightly-less confident, but opcode and funct3 near-certain:
ANDC (R-type)
{opcode, funct3, funct7} = {$OP, $AND, 7'b0100000}
Placed here by considering funct3 $SLT and $SLTU as "special function"
codes:
PCNT (R-type)
{opcode, funct3, funct7, rs2} = {$OP, $SLT, 7'b0100000, 5'b00000}
BEXT (R-type)
{opcode, funct3, funct7} = {$OP, $SLT, 7'b0010000}
BDEP (R-type)
{opcode, funct3, funct7} = {$OP, $SLT, 7'b0110000}
CLZ (R-type)
{opcode, funct3, funct7, rs2} = {$OP, $SLT, 7'b0001000, 5'b00000}
CTZ (R-type) (hardware opcode tentatively reserved)
{opcode, funct3, funct7, rs2} = {$OP, $SLT, 7'b0101000, 5'b00000}
The funct7 codes in this group are bit-reflected counting in binary,
starting at bit 30.
Additionally, ANDCI, PCNTI, CLZI, CTZI, are all possible as assembler
pseudo-instructions. ANDCI simply assembles to ANDI with the immediate
complemented. PCNTI, CLZI, and CTZI can all be evaluated in the
assembler and assembled as LI with the result.
BEXT and BDEP require XLEN-bit control words and therefore cannot have
an immediate form.
The *W variants on RV64 are encoded almost identically, but with the
opcode field changed to $OP-32 or $OP-IMM-32 as appropriate. The same
pattern provides the *D variants for RV128.
Note that *W and *D variants of some of these instructions could be
assembler pseudo-instructions and macro-op fused at runtime.
= =
> Also: I think that (1) concrete instruction encodings are a red herring at
> this stage and (2) there is a separate Opcode Space Management Task Group
> within the RISC-V foundation and they would have the last word on concrete
> encodings anyways.
>
I firmly believe that any proposed instruction should include a proposed
concrete encoding, if for no other reason than to prove that the
instruction *can* be encoded in RISC-V's constraints.
Hi,
On Fri, Apr 06, 2018 at 11:03:57PM -0500, Jacob Bachmeyer wrote:
>> The problem isn't lack of volunteers. I would have volunteered to keep the
>> task group alive (at least until a new proper chair would have been found),
>> but I do not have the $10,000 per year to buy a gold membership.
>
> Could the discussions continue here on the isa-dev list?
Sure. But while the spec text itself might not be perfect, the set of
instructions is pretty much at a point where the path forward is to
actually implement them in at least one compiler and some cores and
continue with doing proper quantative analysis using actual hardware
designs to estimate the hw cost and actual programs to measure the gains.
But it is *much* easier to convince people to contribute to that when you
have a mandate by being an official task group. So I'm afraid it will be
hard to actually convince compiler and hw devs to participate.
I'd also expect "continued discussions" here on the isa-dev list would only
be repetitions of the discussions already had and to be honest I don't know
how much time and energy I have for repeating past discussions and
reiterating past arguments without clear outlook of this going anywhere.
bcnt rd rs1 rs2 pop_count(rs1 &~ rs2) #bcnt is more general than pcntbcnt (R-type){opcode, funct3, funct7} = {$OP, $AND, 7'b0101'000}pseudo'spcnt rd rs1 bcnt rd rs1 zerozcnt rd rs1 li rd -1; bcnt rd rd rs1ctz rd rs1 sub rd zero rs1; bcnt rd rd rs1 #note this does ctz: 0 --> 0
-- Jacob
Rogier Brussee wrote:
> Op dinsdag 10 april 2018 06:28:56 UTC+2 schreef Jacob Bachmeyer:
>
> Rogier Brussee wrote:
> >
> That has some merit, but I like to think of B as just some additional
> more specialised I instructions and I think it should just get as
> little in the way as possible for other extensions that want to define
> variants of existing instructions. Having B instructions close to
> "and", "or" and "xor" as well as sll and srl feels right and is what
> I proposed for hamm/bcnt and clz. That feeling may well be just that
> though: an irrational feeling.
If the CLZ opcode includes an ANDC (clz(rs1 & ~rs2)), encoding it near
ANDC might be an interesting option, especially if the same bit can
select Hamming distance ("HADI"?) when used near XOR.
> [...]
> > BEXT and BDEP require XLEN-bit control words and therefore
> cannot
> > have
> > an immediate form.
>
>
> One could define BEXTI and BDEPI like BEXT and BDEP but with rs2
> replaced with a _zero_ extended 12 bit immediate to grab and insert in
> the low 12 bits (or effectively the low 11 bits if you insist on sign
> extension of immediates).
Possibly, but there is no particularly good place to encode immediate
forms of these instructions; the only space left in OP-IMM is near the
shift opcodes. (The encodings of GREV/ROL/ROR near shift suggest
similar immediate-form opcodes.) Are B{EXT,DEP}I useful enough to
justify allocating such limited space rather than using a 2-instruction
ADDI/{BEXT,BDEP} pair? If so, B{EXT,DEP} could also be encoded near shift.
I was briefly concerned while reviewing the instruction tables that
RV64I shifts could conflict with RVM. This was in error, since RVM
opcodes exist only in OP, and the expanding "shamt" field exists only in
OP-IMM. This does, however, mean that there is a bit of additional
encoding space near shift for instructions that do not have immediate
forms. Two bits are thus available here (when RV128 is considered) that
cannot be used in OP-IMM near shifts, and are thus available for funct7
codes in OP. The base ISA uses 1/4 (funct7 = 7'b0000000) of this space
and RVM uses another 1/4 (funct7 = 7'b0000001) of this space. The two
remaining funct7 codes are 7'b0000010 and 7'b0000011. This corner of
the encoding is available near all operations in OP, but *not* near the
shifts in OP-IMM. No space is available in OP-IMM near other operations.
-- Jacob
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/20180412095015.GA8654%40clifford.at.
-- Jacob
There is also the issue of encoding regularity, which is reduced by
omitting instructions like this.
> You would start with an immediate. Then most masks I can think of (so
> it maybe me) would be constructed just as well with SLLI, SRAI (which
> compress), RORI, ANDC (so a negative mask works just as well as a
> positive) and last not least, a SROI pseudo that is just
>
> SROI rd rs1 imm --> XORI rd, rs1, -1; SRLI rd rd imm; XORI rd rd -1
>
> (where if it follows a LI the first XORI can be absorbed in the
> immediate of the LI, and the last XORI can be absorbed in an ANDC if
> the next instruction is an AND instruction. Hell, the OP opcode space
> for R-type instructions has comparatively lots of room, so an ORC
> instruction encoded like ANDC but with func3 = OR to absorb not's
> may have better value for money)
That would be an argument against the shift-ones instructions, since the
same works for SLOI/SLLI.
I believe that the shift-ones instructions,
particularly with an encoding using bit 29 to select shift-ones are
simple enough to be worthwhile and that the entire set of
SLO/SRO/SLOI/SROI together are simpler to implement than any subset.
-- Jacob
[
PS: The current XBitmanip draft spec contains an additional chapter
titled "Alternatives to bext and bdep" that proposes the two additional
unary instructions ZIP and UNZIP:
https://raw.githubusercontent.com/cliffordwolf/xbitmanip/master/xbitmanip-draft.pdf
* bsrm - Shift Right and Mask (w,d,q) C expression: (src >> start) & ((1 << masklen)-1)
RISC-V Instruction Decoding
-- Jacob
Hi Clifford,
I had a separate note last week started, thanking you and the other folks for picking up work on the BitManip spec, but got side-tracked before finishing it. Thanks for doing so!
One of the things I noticed while reviewing the new draft - clz no longer includes the result + immediate adder delay at the end, and that's a good thing to have eliminated.
Adding 2B instr. encodings for common cases is also wise.
Adding 32-bit operand size instruction variants also is positive. Who's to say what a given implementation's max logic depth will be, if whether or not some of these new ops like popcnt and so on might be two-cycle 64-bit ops but one-cycle 32-bit ops? Best not to assume that everyone's going to be fine running 64-bit ALU ops for 32-bit data. (And 32-bit operations save power vs 64-bit ops.)
Regarding the below comment on macro-op fusion - I don't think we should assume that all RISC-V implementations that care about performance have designed instruction fusion logic into their instr. decoders. Isn't it better to make sure that the ops (that we care about the performance of) map to a single instruction where possible, so that they don't take a perf hit on fusion-less machines?
(I realize there's a sentiment from the original RISC-V architects to the effect that all performant RISC-V implementation need to implement instruction fusion. But I strongly disagree. If other competing ISAs don't require fusion logic for similar operations (or the extra I-stream bandwidth requirements it encourages), then shouldn't RISC-V try to be competition here as well?)
Thanks!
Dan Hopper
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at
https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9a6b5ceb-871c-4e3e-9e0c-050f48f95e05%40groups.riscv.org.
I had a separate note last week started, thanking you and the other folks for picking up work on the BitManip spec, but got side-tracked before finishing it. Thanks for doing so!
Regarding the below comment on macro-op fusion - I don't think we should assume that all RISC-V implementations that care about performance have designed instruction fusion logic into their instr. decoders. Isn't it better to make sure that the ops (that we care about the performance of) map to a single instruction where possible, so that they don't take a perf hit on fusion-less machines?
(I realize there's a sentiment from the original RISC-V architects to the effect that all performant RISC-V implementation need to implement instruction fusion. But I strongly disagree. If other competing ISAs don't require fusion logic for similar operations (or the extra I-stream bandwidth requirements it encourages), then shouldn't RISC-V try to be competition here as well?)
On Saturday, March 12, 2016 at 8:28:38 AM UTC+1, Michael Clark wrote:
* bsrm - Shift Right and Mask (w,d,q) C expression: (src >> start) & ((1 << masklen)-1)
Hi Clifford,
Ah, yes I do agree in this bsrm case. Thanks for the detailed explanation. Also wrt the 2 * 2B instruction fusion case not being difficult. Many of the pre-existing fusion cases, however, don't fit into this convenient scenario, thus my general concern. (Apparently misplaced for the bsrm case, however.)
I'm not 100% on the same page wrt the x86 fusion examples, as those are just ordinary variable-length x86 instructions, handled by any compliant x86 decoder, where valid instruction lengths are 1-15 bytes. True, it's not reasonable to expect the same functionality to be packed into a shorter instruction length, your main point.
(In a separate discussion, I might argue for supporting additional RISC-V instruction lengths of 6B and 8B as being preferable to implementing instruction fusion. The former is mostly limited to the decoder logic, the latter also ripples down to retire/exceptions, since you have to be able to split up the fused op's retires in certain exceptional cases.)
Thanks,
Dan
Hi,On Mon, Apr 16, 2018 at 8:31 PM, Dan Hopper <dho...@tesla.com> wrote:I had a separate note last week started, thanking you and the other folks for picking up work on the BitManip spec, but got side-tracked before finishing it. Thanks for doing so!
Thanks.Regarding the below comment on macro-op fusion - I don't think we should assume that all RISC-V implementations that care about performance have designed instruction fusion logic into their instr. decoders. Isn't it better to make sure that the ops (that we care about the performance of) map to a single instruction where possible, so that they don't take a perf hit on fusion-less machines?
(I realize there's a sentiment from the original RISC-V architects to the effect that all performant RISC-V implementation need to implement instruction fusion. But I strongly disagree. If other competing ISAs don't require fusion logic for similar operations (or the extra I-stream bandwidth requirements it encourages), then shouldn't RISC-V try to be competition here as well?)
The operation in question was the following:On Saturday, March 12, 2016 at 8:28:38 AM UTC+1, Michael Clark wrote:
* bsrm - Shift Right and Mask (w,d,q) C expression: (src >> start) & ((1 << masklen)-1)
And the macro-op fusable sequence that I proposed for it was this:c.slli a0, (32-START-LEN)c.srli a0, (32-LEN)If we want to support something like "bsrm" all the way to RV128 then the instruction would have a 2x7 = 14 bit immediate.That instruction would be half of a major opcode (!), or over 32x times the size of the entire xbitmanip extension ascurrently proposed.Also, some implementations might not actually want to spend the extra area for a second barrel shifter for thisoperation. With macro-op-fusion it is up to the individual implementation if they want to support this operation in asingle micro-op or in two micro-ops.While I agree that generally speaking macro-op-fusion is an advanced technique that only a small number of coreswill support, I'd argue that macro-op-fusion of two compressed 16-bit opcodes into one 32-bit opcode is a specialcase that is relatively easy to support.Re macro-op-fusion in competing architectures: x86 *does* perform macro-op-fusion, just not for this function.
The x86 BMI1 "bextr" instruction (the register variant) is a 5 byte opcode. (It would be possible to implement a similarfunctionality as command within the "shuffle" instruction, see current xbitmanip-draft.pdf. But to be honest, I don't thinkthat would be worth it.)The x86 TBM "bextr" instruction (that is the one with immediates) is a 7 byte instruction. Trying the squeeze the samefunctionality in a 4 byte RISC-V instruction, just because another already existing 4 byte encoding might be a bit harderto decode, seems like a big waste of RISC-V instruction encoding space to me.I hope you can agree that this are all good arguments against a dedicated "bsrm" instruction.Regards,- clifford
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAG5EYeWQT%2BRQ6G98ukEaqtgxdMEFGx_RMmqgELpC3NF1MWpcqw%40mail.gmail.com.
zip + butterfly(mask, 0)
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5AD6B0F4.9070105%40gmail.com.
I really need to look at how butterfly is actually implemented in HW.This is basically a GREV with a specific power-of-2 swap distance, but with individual bit enables.
A question: does
zip + butterfly(mask, 0)
mean zip followed by butterfly, { butterfly(zip()) }or butterfly followed by zip, { zip(butterfly()) }or something different ?
As I had mentioned in the birt manipulation group meetings, you can implement butterfly with all1s mask using exisitng shift logic.
Once you add the mask, though, all bets are off; it turns out to be nasty, as you need control each individual swap (as opposed to a shifter, which has one control for each log2 shift). The amount of logic is trivisl; the amount of wiring is horrible.
I'll think about how to duce that to something more reasonable ( shifters already hav a lot of horizontal wires because you need to end lsbs to the msbs and vice versa).Separately: it was mentioned at Coolchips that there is no big-endian support in RiscV.The butterfly/grev ops can alleviate that - but it might be easier if the mask is inverted (so 0 mean swap, and 1 means dont).They control constants in the normal cases can more easily be generated, as all the upper bits ("including mask) are zeroes.This is less than a perfect solution; a better solution is another B-extension (Big_endian).
I totally agree that email is a bad medium to discuss this- I was going to say the same thing. I would love to see your sketches.
I was hoping to almost completely reuse the shifter for grev - for all the most useful cases ( which I thought were the k=2^n cases)- but the really useful ones are k=-2^n cases.
The reason I thought that the mask should be inverted was because I thought the common case ( all swaps enabled, e.g. for big-median support) it would be much more difficult to generate a constant with many 1s, but many 0s is easy.
Finally- it isn’t clear to me how to implement a load signed big endian halfword, even with with grev.
lhu a0, offset(a0)grevi a0, a0, 24srai a0, a0, 16
just going through V0.35-draft, it looks really good, exceptionally
comprehensive. question: 2.8 why return 0 in rd rather than throw an
exception? exception allows software-emulation in a trap, what did i
miss?
spelling mistake "opecodes" p14. patch attached (git am apply),
there's quite a few others.
p22 can i suggest including a reference implementation (c-code) of
64-bit BEXT/BDEP for emulation in a software-trap so that implementors
can make an informed assessment (and we can take a look as well)?
in particular i'd be interested to see how a 64-bit BEXT/BDEP would be
implemented using a combination of 32-bit BEXT/BDEP plus other 64-bit
xBitManip primitives.
Ah- I couldn’t figure out how you did it in so few gates until I looked at the code and saw it was for RV32 instead of RV64. In any case, the amount of logic is trivial; it’s the wires that cost area (across the datapath).
I thought you were proposing shuffle/unshuffle as a replacement for grev, not in addition to it. I think I see how you were implementing it by passing the mask down at each stage and shuffling it along with the data, and using it to enable the mux select (so mask would be forced to all1 for grev when mux selects can have any value, and be allowed to be set to something else when mux selects are limited to a single bit)