This was originally intended to be "reply-all". I accidentally
originally sent this to just Allen by mistake. My apologies for the
spam, Allen.
On Sun, Oct 29, 2017 at 4:22 PM, Allen J. Baum
<allen...@esperantotech.com> wrote:
> generally you'll have custom HW that will do exactly those permutations (e.g. crypto apps).
Forgive me if I'm stating the obvious. I'm responding only because I
don't want to see RISC-V's official ISA pigeon-holed.
I would like to see feature parity with the like of POWER-architecture
processors, at least. Any bit-manipulation instruction set extension
should, at the very least, facilitate these jobs:
1) RISC-V emulator instruction decode performance,
2) 2D graphics manipulation (e.g., bit-blits).
3) Constant-time cryptography.
Security is something of a "big thing" these days; without suggesting
we cater specifically to that domain, I also think it's short-sighted
to design an instruction set that makes this more difficult than
necessary. Despite modern-day x86 and ARM processors having
instructions for specific classes of cryptography, they obviously
aren't universal, as blog posts about how certain ways of writing
software results in non-constant-time operations. Also, new crypto
algorithms are being researched all the time, so baking some
instructions for AES today, and Elliptical curve tomorrow, etc. will
result in vastly more ISA pollution than just biting the bullet and
adding support for generalized bit-manipulation instructions today.
Further, with respect to depending on the presence of custom HW, this
assumes you're always using RISC-V for embedded applications. This is
an invalid assumption: *I* intend on using them as general purpose
CPUs for desktop-class computers, for instance. I know, I'm in the
minority. But, still, I'm an existence proof.
>>Synthetic biology is what immediately comes to mind (nucleotide sequences require radix=4) but there are likely many potential future applications·
>
> Again: if this a really important app, then a synthetic biology coprocessor would make more sense.
One of the earlier RISC-V Workshops I've attended had about *half* of
its attendance from engineers looking at the possibility of using
RISC-V processors for supercomputer construction. They covered
everything from large-scale database engines to scientific computing
clusters the size of buildings. It is foolish to say, "If this is a
really important app." I don't think it's for you to decide. The ISA
should be as orthogonal as possible; do not prematurely optimize. You
can never tell who is going to want to use RISC-V going forward.
Just my $0.02.
--
Samuel A. Falvo II
Hi Folks,
Bit manipulation instructions have been on my mind for a while as I am interesting in creating some encodings so that we can add candidate instructions to a branch of GCC so that we can start looking at stats such as dynamic retired instructions or cycle counts. I’m not yet a member of the BitManip Group in the riscv.org workgroup system so I’m posting here to ISA dev. I haven’t seen any public updates from the BitManip group, and was expecting there would be some open discussion on ISA-dev (a la IETF).
[snip]
## Rotate
RLL[W] rd,rs1,rs2 Rotate Left Logical Rotate bits in rs1 left by the amount in rs2
RRL[W] rd,rs1,rs2 Rotate Right Logical Rotate bits in rs1 right by the amount in rs2
RLLI[W] rd,rs1,shamt Rotate Left Logical Immediate Rotate bits in rs1 left by the immediate
RRLI[W] rd,rs1,shamt Rotate Right Logical Immediate Rotate bits in rs1 right by the immediate
Rotate instructions can be implemented very easily with the addition of carry-out carry-in logic to an existing shifter. Rotates are very simple instructions but they are frequently used in cryptographic ciphers so the small saving in cycles (3:1) is likely worth the additional area for processors that implement the B extension.
## Byte swap
BSWAP[W] rd,rs1 Byte Swap Swap byte order in rs1
Byte swapping instructions are essential for networking code and cryptographic ciphers which typically use big endian formats. A 32-bit byte swap takes ~14 instructions and a 64-bit byte swap takes ~30 instructions.
## Count Leading and Trailing Zeros
BCLZ[W] rd,rs1 Bit Count Leading Zeros Count leading zero bits in rs1
BCTZ[W] rd,rs1 Bit Count Trailing Zeros Count trailing zero bits in rs1
BCNT[W] rd,rs1 Bit Count Count number of bits set in rs1
BREV[W] rd,rs1 Bit Reverse Reverse bits in rs1
Count leading and trailing zeros can be constructed using popcount (BCNT) and bit reverse (BREV) instructions or functional units. Given the length of the instruction sequences and the possibility to multiplex via shared bit reverse and popcount functional units, the count leading and trailing zeros instructions are included.
## Parallel Bit Extract and Deposit
BEXT[W] rd,rs1,rs2 Bit Extract Gather LSB justified bits to rd from rs1 using extract mask in rs2
BDEP[W] rd,rs1,rs2 Bit Deposit Scatter LSB justified bits from rs2 to rd using deposit mask in rs2
Bit Extract
Gather LSB justified bits to rd from rs1 using extract mask in rs2. Bits are extracted from the source register using bit positions enabled in the mask register and are placed in the result as popcount(mask) LSB contiguous bits.
Register Value
source 0b11110100
mask 0b01100011
result 0b00001100
Bit Deposit
Scatter LSB justified bits from rs2 to rd using deposit mask in rs2. Bits are deposited from the source register using popcount(mask) LSB contiguous bits and are placed in the result at the bit positions enabled in the mask register.
Register Value
source 0b11110100
mask 0b01100011
result 0b00100000
AND-NOT and OR-NOT have been suggested, and could be easily encoded near
the current AND and OR, but would not have immediate forms at all due to
lack of encoding space. This would break an elegant symmetry in RISC-V.
I omit them for this reason.
-- Jacob
[snip]
I maintain that W versions are required for all of them not only for consistency but to avoid additional canonicalisation e.g. ZEXT.W. Why skip W on just CTZ when the results tangibly differ depending on the content of the upper bits for all of the instructions. RISC doesn’t mean make extra work due to extreme minimalism. It means choosing the instructions emitted by the compiler. Unfortunately there is a lot of code that uses int32_t and uint32_t. We can’t depend on programmers changing their code to use variable width types. I’d say there would be a lot of 32-bit code that uses __builtin_popcount/__builtin_clz/__builtin_ctz and uint32_t vs __builtin_popcountll/__builtin_clzll/__builtin_ctzll and uint64. Having only 32-bit versions is essentially saying it needs to be more than one instruction. The compilers currently emit canonicalisation instructions when a 64-bit instruction is used on a 32-bit value.
I won’t go as far as saying that we should have ANDW/ORW/XORW/NEGW, but if we would save some engineering on REE (Redundant Extension Elimination). The state of the art in open source compilers are not aware that these instructions don’t affect canonicalisation of the upper bits.
The question is how many programs would actually use CTZW/CLZW?
input. To me, this seems like the most logical place to add them would
therefore be SLT/SLTU.
I am not dead set on this if someone can explain
how those operations make more sense grouped with other hardware.
For
example, if POPCNT can be more easily implemented by adding a few gates
to an adder than by modifying the comparator used by SLT or adding a new
functional block, I would support encoding POPCNT near ADD.
> GREVI:
>
>>> The question is how often will GREVI be used where k is not 31 or 24.
>>>
>> potentially a lot if you do things like arbitrary bit permutations:
>> http://nbviewer.jupyter.org/url/svn.clifford.at/handicraft/2017/permsyn/data.ipynb
>>
Arbitrary permutations may not be high on the list, but I like the idea
of fewer, more flexible instructions -- GREV gives us both BSWAP and
BREV as special cases and all other possible permutations, while
conveniently fitting into a corner of the opcode space that is otherwise
very hard to use.
-- Jacob
Arbitrary permutations may not be high on the list, but I like the idea
of fewer, more flexible instructions -- GREV gives us both BSWAP and
BREV as special cases and all other possible permutations, while
conveniently fitting into a corner of the opcode space that is otherwise
very hard to use.Which bit permutations does GREVI actually give. Clearly not all 64 factorial ones.
On Mon, Oct 30, 2017 at 7:26 AM, Rogier Brussee
<rogier....@gmail.com> wrote:
>>
>> AND-NOT and OR-NOT have been suggested, and could be easily encoded near
> There is no SUBI either, and for the same reason: just like you can do
> an ANDI with a negated immediate, you can do an AND with a complemented
> immediate
> if you need a ANDNOTI
AND-NOT (NAND) is not the same as inverting one input to an AND operation.
.....
Which bit permutations does GREVI actually give. Clearly not all 64 factorial ones.
It doesn't give all 64 factorial permutations in a single instruction, clearly.
Seems to me it's more like a "generalized rotate" instruction. A definition is at https://groups.google.com/a/groups.riscv.org/forum/#!msg/isa-dev/PeTwBaKuC9w/E6kKdcs_AwAJ
#ifndef softfloat_countLeadingZeros32 | |
/*---------------------------------------------------------------------------- | |
| Returns the number of leading 0 bits before the most-significant 1 bit of | |
| 'a'. If 'a' is zero, 32 is returned. | |
*----------------------------------------------------------------------------*/ | |
#if defined INLINE_LEVEL && (3 <= INLINE_LEVEL) | |
INLINE uint_fast8_t softfloat_countLeadingZeros32( uint32_t a ) | |
{ | |
uint_fast8_t count = 0; | |
if ( a < 0x10000 ) { | |
count = 16; | |
a <<= 16; | |
} | |
if ( a < 0x1000000 ) { | |
count += 8; | |
a <<= 8; | |
} | |
count += softfloat_countLeadingZeros8[a>>24]; | |
return count; | |
} | |
#else | |
uint_fast8_t softfloat_countLeadingZeros32( uint32_t a ); | |
#endif | |
#endif | |
-- Jacob
> I still like BGATHER, BSCATTER which actually makes sense
The names in the literature are Parallel Extract and Parallel Deposit however I agree Bit Gather and Bit Scatter are more intuitive names, but unfortunately the opcodes are very long and will much up tabs.
> I don't see why one would need both a left and right rotate. E.g. RRLI
> rd rs1 shamt gives exactly the same result as RLLI rd rs1 (XLEN - shamt)
For symmetry with the shift instructions, where both directions are
needed, although I would drop the trailing "L" from the mnemonics --
there is no "arithmetic rotate".
>
> ## Byte swap
>
> BSWAP[W] rd,rs1 Byte
> Swap Swap byte order in rs1
>
> Byte swapping instructions are essential for networking code and
> cryptographic ciphers which typically use big endian formats. A
> 32-bit byte swap takes ~14 instructions and a 64-bit byte swap
> takes ~30 instructions.
>
>
>
> Even BSWAPH might be worse it as it is used so extensively in
> networking code. I guess that could be a GREVI mnemonic too.
This is why I favor GREVI over individual swap/reverse instructions.
-- Jacob
(2) This is the B extension. By definition, the features added by it are
aimed at bit manipulation code and will not be benefiting other code a lot.
So discussing what general non-bitmanip code needs is a strawman.
(3) You need log2(XLEN) bits of parameters for a general permutation
operation.
On Tue, Oct 31, 2017 at 3:23 PM, Clifford Wolf <clif...@clifford.at> wrote:(2) This is the B extension. By definition, the features added by it are
aimed at bit manipulation code and will not be benefiting other code a lot.
So discussing what general non-bitmanip code needs is a strawman.They should still be useful on most high volume general purpose computing devices.
Things useful for, say, secure communications over networks (while minimising power consumption) would I suggest fall under the bar.
Hi,
On Tue, Oct 31, 2017 at 04:22:16PM +0300, Bruce Hoult wrote:
> > (2) This is the B extension. By definition, the features added by it are
> > aimed at bit manipulation code and will not be benefiting other code a lot.
> > So discussing what general non-bitmanip code needs is a strawman.
>
> They should still be useful on most high volume general purpose computing
> devices. Things useful for, say, secure communications over networks (while
> minimising power consumption) would I suggest fall under the bar. Things
> only useful for simulating nuclear weapons or analysing data from medical
> or geological instruments might not -- those guys can afford to make custom
> processors.
I think you are mixing things here that really have nothing to do with each
other, namely bitmanip/non-bitmanip code (which is what I was talking about)
and concrete application domains that might make use of bitmanip code.
Some bitmanip operations might have applications in "simulating nuclear
weapons or analysing data from medical or geological instruments". But that
does not mean that they are only useful in that one domain!
And I seriously
doubt that you will find any opcode in the current B draft spec that can be
connected to just one of those application domains.
I was talking about bit-manipulation vs. non bit-manipulation code. Some
applications domains have more bit-manipulation code than others. If you
build HW specifically for a domain that need bit-manipulation code then
just don't implement the B extension.
So your point is a red herring and has nothing to do with the issues being
discussed.
If you disagree then you should actually name the specific instruction from
the draft B spec that you think is only useful in a narrow application
domain. Then we can discuss that specific claim. Otherwise this is just
irrelevant noise.
>> Synthetic biology is what immediately comes to mind (nucleotide sequences require radix=4) but there are likely many potential future applications·
>
> Again: if this a really important app, then a synthetic biology coprocessor would make more sense.
Not how most of the high performance computing is done today.. In most
cases one will have to make do with what commodity processors provide.
regards,
- clifford
(2) This is the B extension. By definition, the features added by it are
aimed at bit manipulation code and will not be benefiting other code a lot.
So discussing what general non-bitmanip code needs is a strawman.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/711EE4CA-32EB-4CBB-B8BE-4935E0DB2F35%40mac.com.
Everybody take a deep breath…. Good…. I have been traveling on business out of the country, so I am behind on getting the next Draft of the spec out. It will not take into account all the discussion the last four days, but the next one will in time for the workshop.
--Rex
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/711EE4CA-32EB-4CBB-B8BE-4935E0DB2F35%40mac.com.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at
https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPZiS_68G0vEULHxGqbK69dLD2PERHjN%3DqMcR1W-7TvPsrtQ%2Bw%40mail.gmail.com.
TL;DR: quantify quantify quantify
for a preference). Some random examples: Linux kernel, SPEC benchmark,ssh, OpenSSL, "TLS 1.3", x264, Mozilla Firefox, etc. These might not bethe best examples, but they are examples of software for which thenumber of cycles spent over the number of users is significant enough.
On Tue, Oct 31, 2017 at 12:39 PM, Michael Clark <michae...@mac.com> wrote:
> On 1/11/2017, at 8:34 AM, Michael Clark <michae...@mac.com> wrote:
>
>> On 1/11/2017, at 3:39 AM, Rogier Brussee <rogier....@gmail.com> wrote:
>>
>> BCTZ[W] rd,rs1 ???? BREV rd rs1; BCLZ rd, rd
>
> Count Trailing Zeros is used frequently for a fast log2 of an integer. I come across it relatively often.
Log2 of the lowest set bit to be more precise or used under the constraint that the input is a power of two, but you get what I mean. e.g.
assert(n == (n & -n)); /* is power of two */
int log2n = ctz(n);
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
A barrel shifter is by nature xlen^2 in area and xlen in depth.
No - you didn't count the wires. It's quadratic even in a log shifter.
>>>>> On Thu, 2 Nov 2017 21:47:22 +0300, Bruce Hoult <br...@hoult.org> said:
| On Thu, Nov 2, 2017 at 8:37 PM, Michael Clark <michae...@mac.com> wrote:
| A barrel shifter is by nature xlen^2 in area and xlen in depth.
| Wait, what?
| A shift/rotate unit with constant delay is normally done by a stage that can
| optionally shift/rotate by 1 position, then a stage that can shift by 2 bits,
| then a stage that can shift by 4 bits, then stages that can shift by 8, 16, 32,
| 64 bits...
| Each stage is directly controlled (pass through unchange, or shift) by one bit
| from the shift amount.
| The delay is proportional to log(xlen) and the area to xlen*log(xlen).
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEky698b2BqXodyLsgtu0ZpR9HCkvyq4dSvBAm2Hfccf_kA%40mail.gmail.com.
Op dinsdag 31 oktober 2017 14:22:18 UTC+1 schreef Bruce Hoult:
On Tue, Oct 31, 2017 at 3:23 PM, Clifford Wolf <clif...@clifford.at> wrote:(2) This is the B extension. By definition, the features added by it are
aimed at bit manipulation code and will not be benefiting other code a lot.
So discussing what general non-bitmanip code needs is a strawman.They should still be useful on most high volume general purpose computing devices.Things useful for, say, secure communications over networks (while minimising power consumption) would I suggest fall under the bar.
F.W.I.W. I mostly agree but the problem is of course that everybody thinks its own applications are super important.I would also add that things that are common in embedded use would seem reasonable. Last reason I can think of is that it opensup useful opportunities for fusing instructions and the instructions themselves are cheap.So I think it would be really useful to have a list of important applications going with the instructions and alternatives to cater forthese instructions! Here is my attempt using my memory of what I have seen pass by and happen to know. YMMV.
Opcode application(s) Alternatives
RLL[W] rd,rs1,rs2 Cryptography, fusion
RRL[W] rd,rs1,rs2 same ADDI rd, rs2, -XLEN; RRL[w] rd, rs1, rd
RLLI[W] rd,rs1,shamt same
RRLI[W] rd,rs1,shamt same RLLI rd, rs1, (XLEN -shamt)
BCLZ[W] rd,rs1 softfloat
BCTZ[W] rd,rs1 ???? BREV rd rs1; BCLZ rd, rd
BCNT[W] rd,rs1 malloc, bitmaps
BREV[W] rd,rs1 reverse addresses, bitmaps?? GREVI rd, (XLEN -1)
BSWAP[W] rd,rs1 be <-> le e.g. networking, GREVI rd, (XLEN - 8) (?)
GREVI[W] rd rs1 shamt see BSWAP, BREV bitpermutations BSWAP, BREV instructions for common case, trapping for others
BGATH[W] rd,rs1,rs2 instruction decoding in traps, bitmaps masking and shifting, instructions or csr's that directly "parse" the trapping instruction into (instr, rd, rs1, rs2, rs3, imm/shamt)??????.
BSCAT[W] rd,rs1,rs2 bitmaps masking and shifting.ANDNOT rd rs1 rs2 clear bit, fusion XORI rd, rs2, -1; AND rd, rd, rs1, (possibly extending C extension with not instruction)ORNOT rd rs1 rs2 masking, fusion XORI rd, rs2, -1; OR rd, rd, rs1,SLLM rd rs1 rs2 constructing masks, fusion XORI rd, rs1, -1; SLL rd, rd, rs2; XORI rd, rd, -1SLLMI rd rs1 shamt same XORI rd, rs1, -1; SLLI rd, rd, shamt; XORI rd, rd, -1