Recommend target application for bit manipulation isa

213 views
Skip to first unread message

黃柏瑋

unread,
Feb 4, 2017, 9:12:52 AM2/4/17
to RISC-V ISA Dev
Hi all,
Recently, bit manipulation task group and I are searching for target application of bit manipulation isa for benchmarking.
I’ve spend some time collecting and raise a poll in the RISC-V@Taiwan FB group. ( https://www.facebook.com/groups/riscv.tw/.)
However, I thought that we might need more input, as we need to cover a wide range of application.

Could anyone suggest some good target application for bit manipulation ISA?
I would pass every suggestion into the group and cite your name.

Thanks

Po-wei Huang




Samuel Falvo II

unread,
Feb 4, 2017, 10:31:59 AM2/4/17
to 黃柏瑋, RISC-V ISA Dev
On Sat, Feb 4, 2017 at 6:12 AM, 黃柏瑋 <poweih...@gmail.com> wrote:
> Could anyone suggest some good target application for bit manipulation ISA?

Bit-blitting on monochrome frame buffers and off-screen bitmaps. It's
probably archaic by today's standards, but it's still relatively rare
to find GPUs in small, low-power embedded devices.

--
Samuel A. Falvo II

Corey Richardson

unread,
Feb 4, 2017, 2:56:36 PM2/4/17
to 黃柏瑋, RISC-V ISA Dev
Chess engines like Stockfish do a lot of bit ops.

It's not much of an application, but as microbenchmarks go you might grab some of the examples from and see if they can be accelerated https://graphics.stanford.edu/~seander/bithacks.html


On Sat, Feb 4, 2017, at 09:12, 黃柏瑋 wrote:
> Hi all,
> Recently, bit manipulation task group and I are searching for target application of bit manipulation isa for benchmarking.
> I’ve spend some time collecting and raise a poll in the RISC-V@Taiwan FB group. ( https://www.facebook.com/groups/riscv.tw/ <https://www.facebook.com/groups/riscv.tw/>.)
> However, I thought that we might need more input, as we need to cover a wide range of application.
>
> Could anyone suggest some good target application for bit manipulation ISA?
> I would pass every suggestion into the group and cite your name.
>
> Thanks
>
> Po-wei Huang
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.



Andrew Waterman

unread,
Feb 4, 2017, 4:51:21 PM2/4/17
to 黃柏瑋, RISC-V ISA Dev
I suggest including lots of general-purpose compiled code (e.g.
SPECint), which will discourage anecdotal optimization.

In your analysis, I suggest including the `C' extension, then measure
improvement in dynamic instruction bytes fetched. This will
discourage the temptation to optimize common 2-instruction sequences
into new `B' instructions. Doing so would be a false economy, because
high-performance implementations will pattern-match common `C'
instruction pairs into single internal operations. For example, an
instruction that can zero- or sign-extend from arbitrary bit positions
would not be a great addition, since it maps to C.SLLI + C.SR{A/L}I.

Tommy Thorn's original `B' proposal (POPC + BREV + BSWAP) was
attractive because it only included instructions that are both useful
and expensive to synthesize from `I' instructions.

On Sat, Feb 4, 2017 at 6:12 AM, 黃柏瑋 <poweih...@gmail.com> wrote:

Bryce Lanham

unread,
Feb 4, 2017, 4:58:08 PM2/4/17
to 黃柏瑋, RISC-V ISA Dev, Corey Richardson
LZMA decompression might be another candidate. There's a benchmark here: http://www.7-cpu.com/, but from my reading it's possible that it might be more affected by branch misprediction and multiplier speed than bit manipulation speed.

EEMBC also has a bit manipulation benchmark (https://www.eembc.org/techlit/datasheets/auto_bit.pdf) as part of a larger embedded microprocessor benchmark suite. It requires licensing, but it appears they do have academic licensing available. I also found a review of the suite here: http://accentsjournals.org/paperInfo_1.php?journalPaperId=876&countPaper=52

--
Bryce Lanham

On Sat, Feb 4, 2017 at 1:56 PM, Corey Richardson <co...@octayn.net> wrote:
Chess engines like Stockfish do a lot of bit ops.

It's not much of an application, but as microbenchmarks go you might grab some of the examples from and see if they can be accelerated https://graphics.stanford.edu/~seander/bithacks.html


On Sat, Feb 4, 2017, at 09:12, 黃柏瑋 wrote:
> Hi all,
> Recently, bit manipulation task group and I are searching for target application of bit manipulation isa for benchmarking.
> I’ve spend some time collecting and raise a poll in the RISC-V@Taiwan FB group. ( https://www.facebook.com/groups/riscv.tw/ <https://www.facebook.com/groups/riscv.tw/>.)
> However, I thought that we might need more input, as we need to cover a wide range of application.
>
> Could anyone suggest some good target application for bit manipulation ISA?
> I would pass every suggestion into the group and cite your name.
>
> Thanks
>
> Po-wei Huang
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Fabjan Sukalia

unread,
Feb 4, 2017, 5:24:48 PM2/4/17
to 黃柏瑋, RISC-V ISA Dev
While writing an emulator for RISC-V I noticed that the UJ and SB instruction formats need 
many masking and shifting operations. Other ISAs might also have instructions that need 
a lot of bit manipulation for decoding or encoding. 

--
Fabjan Sukalia

Michael Clark

unread,
Feb 4, 2017, 7:19:08 PM2/4/17
to 黃柏瑋, RISC-V ISA Dev
One method to find applications is to use a code search tool and search for code that uses some of the pre-existing bit manip compiler intrinsics.

- Byte Swap
- __builtin_bswap32
- __builtin_bswap64
- Count Bits aka popcnt
- __builtin_popcount (32-bit)
- __builtin_popcountll (64-bit)
- Count Leading Zero Bits (result is undefined for zero but most implementations return the word width)
- __builtin_clz (32-bit)
- __builtin_clzll (64-bit)
- Count Trailing Zero Bits (result is undefined for zero but most implementations return the word width)
- __builtin_ctz (32-bit)
- __builtin_clzll (64-bit)
- Find First Set (nearly equivalent to count trailing zeros but returns zero if the argument is zero)
- __builtin_ffs (32-bit)
- __builtin_ffsll (64-bit)

See 
https://en.wikipedia.org/wiki/Find_first_set


The Bit Manip group started making a list of potential benchmark applications:

- GCC - Decoding and encoding of RISC-V instructions (especially RVC) could benefit from fast field extract and deposit

- C Standard Library - gdtoa (decimal to ASCII) in various C standard libraries (printf/sprintf) use count leading and trailing zeros typically via compiler intrinsics / builtins.

- OpenSSL - ciphers and digest algorithms frequently use the rotate primitives
Rotate lacks a compiler intrinsic but can be detected by compilers - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62263

- GMP and other big integer or multiple precision math library codes tend to use count leading zeros for exponent re-normalisation.

- OpenCV and other image manipulation and OCR libraries appear to use popcnt for quantisation in bitmaps.

- BLAST (Basic Linear Alignment Search Tool) and other DNA search tools can use parallel extract and parallel deposit. DNA search tools are dealing with combinatorial searches of large compressed bit strings made up of DNA codons (’T’, ’C’, ‘A’ and G’)


There is a question as to whether bswap16 is worthwhile due to the 3 instruction sequence (two shifts plus an ‘or') for swapping 2 bytes however it appears the compiler emits quite a lot of code, mostly zero extension of the short argument which is presumable already a short, followed by sign extension.



$ cat a.c 
short bswap16(short x)
{
        return __builtin_bswap16(x);
}

int main()
{
        bswap16(65535);
}
$ riscv64-unknown-elf-gcc -O3 -S a.c
$ cat a.s 
.file "a.c"
.option nopic
.text
.align 2
.globl bswap16
.type bswap16, @function
bswap16:
sllw a5,a0,16
srlw a5,a5,16
srlw a5,a5,8
sllw a0,a0,8
or a0,a0,a5
sllw a0,a0,16
sraw a0,a0,16
ret
.size bswap16, .-bswap16
.section .text.startup,"ax",@progbits
.align 2
.globl main
.type main, @function
main:
li a0,0
ret
.size main, .-main
.ident "GCC: (GNU) 6.1.0"

Andrew Waterman

unread,
Feb 4, 2017, 7:51:58 PM2/4/17
to Michael Clark, 黃柏瑋, RISC-V ISA Dev
That code generation is bad, but it's a distraction; better just to
have BSWAP, which operates on XLEN/8 bytes, and can swap a smaller
number of bytes with an additional right shift.
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/FE726325-BDE4-4108-B350-39CA4B828F66%40mac.com.

Michael Clark

unread,
Feb 4, 2017, 8:03:56 PM2/4/17
to Andrew Waterman, 黃柏瑋, RISC-V ISA Dev
I thought about that after seeing your email.

Yes, it makes a lot of sense to have a single BSWAP instruction and to combine it with a shift.

bswap16 and bswap32 will be popular on RV64. TCP and TLS uses bigendian (network byte order) shorts a lot, but 2 instructions is not a big cost.

Allen J. Baum

unread,
Feb 4, 2017, 8:09:05 PM2/4/17
to Bryce Lanham, ¿îêÞ„, RISC-V ISA Dev, Corey Richardson
So, a couple of weeks ago there was a Bit Mainpulation Task Force
meeting online, with a link to join.

That link didn't work for me, and for at least one other person.

I was told that there would be a writeup of the results, but so far nothing.

Did the meeting actually happen?
Could anyone who "attended" that meeting describe what was discussed
or decided?
Is there some secret mailing list I should subscribe to?


--
**************************************************
* Allen Baum tel. (908)BIT-BAUM *
* 248-2286 *
**************************************************

Jacob Bachmeyer

unread,
Feb 4, 2017, 8:53:29 PM2/4/17
to Andrew Waterman, 黃柏瑋, RISC-V ISA Dev
Andrew Waterman wrote:
> Tommy Thorn's original `B' proposal (POPC + BREV + BSWAP) was
> attractive because it only included instructions that are both useful
> and expensive to synthesize from `I' instructions.
>
[apologies in advance for the stream-of-consciousness style; the most
important bit is after the "SLM" heading]

How do EXTRACT-FIELD and WRITE-FIELD fit into this? On one hand, these
can be synthesized from AND/OR and shifts, but on the other hand,
synthesizing them requires forming a mask in a register, which requires
multiple steps.

EXTRACT-FIELD(source, offset, count) :=
((source>>offset)&((1<<(count+1))-1))

WRITE-FIELD(source, offset, count, value) :=
(((~(((1<<(count+1))-1)<<offset))&source)|((value&((1<<(count+1))-1))<<offset))

A 12-bit immediate is exactly enough to encode 6-bit offset/count for up
to RV64I, but that would not help RV128, nor would it help the dynamic
case, which is exactly where these instructions are most useful, since
the complicated expressions to form masks can be evaluated at
compile-time if the offset and count are known. However, WRITE-FIELD is
fundamentally a 5-operand instruction; while offset and count can be
combined into a control word, that only reduces it to a 4-operand
instruction (dest <- WRITE-FIELD(source, offset/count, value)).

Setting the inability to actually encode WRITE-FIELD aside for a moment,
consider encoding EXTRACT-FIELD. It is an R-type instruction, using a
32-bit control word in a register. To improve performance in the common
case and allow the control word to be loaded using a single instruction,
I propose permuting the control word bits and assigning bits 0:5 as
count[0:5] and bits 6:11 as offset[0:5], with higher bits either ignored
or must-be-zero on RV32/RV64 and bits 12:21 as count[6:15] and bits
22:31 as offset[6:15]. This assumes that most bitfields will be less
than 32 bits long, but may appear anywhere in a 64-bit word. In the
static case where these assumptions are met, the control word can be
loaded with a single ADDI and sign extension will fill the top bits with
zero. In the dynamic case on RV32/RV64, the control word can be formed
with a SHIFT/OR sequence, which could be a candidate for macro-op
fusion. The process is a bit more complex for RV128, but EXTRACT-FIELD
can itself be used to split the count and offset inputs, with SHIFT/OR
to build the control word.

This leads to a new operation that permits WRITE-FIELD to be expressed
as a short sequence:

MASK-FIELD(source, offset, count) :=
((~(((1<<(count+1))-1)<<offset))&source)

WRITE-FIELD(source, offset, count, value) = (MASK-FIELD(source, offset,
count)|(EXTRACT-FIELD(value, 0, count)<<offset))

==SLM==

This could still be a bit much to ask and there is a single instruction
that could be added that gives nearly all of the benefit of MASK-FIELD
and EXTRACT-FIELD. I propose SLM and SLMI ("shift-left-mask" and
"shift-left-mask-immediate"). These are left-shift instructions that
shift in ones instead of zeros. The entire ((1<<(count+1))-1)
expression can be expressed as "SLM[I] t0, x0, count". These can be
encoded with new function codes analogous to the way right arithmetic
shift is encoded and right-shift variants can be similarly encoded. The
additional hardware burden could be as little as a single wire from the
instruction decoder to the ALU, replacing the hardwired
"shift-in-zero". There is plenty of room in funct7 under the shift
minor opcodes. I suggest using bit 29 of the instruction word to
distinguish "shift-in-zero" and "shift-in-one".

These shifts also avoid the control word tomfoolery with EXTRACT-FIELD
and MASK-FIELD, which left a bit of a bad taste in my mouth. The new
shifts may also be useful for other applications, while EXTRACT-FIELD
and MASK-FIELD are rather specialized.

In fact, SLM allows the very rapid construction of simple masks, even in
RV64, where loading a 64-bit value is much more expensive. With SLMI:

MASK-FIELD:
SLMI t0, x0, count
SLLI t0, t0, offset
AND t0, t0, source

EXTRACT-FIELD:
SLMI t0, x0, count
SRLI t1, source, offset
AND t0, t0, t1

WRITE-FIELD:
SLMI t0, x0, count
SLLI t1, t0, offset
NOT t1, t1
AND t0, t0, value
SLLI t0, t0, offset
AND t1, t1, source
OR t0, t0, t1

-- Jacob

Michael Clark

unread,
Feb 4, 2017, 8:53:47 PM2/4/17
to Andrew Waterman, 黃柏瑋, RISC-V ISA Dev
On 5 Feb 2017, at 1:51 PM, Andrew Waterman <and...@sifive.com> wrote:


It would be possible to do the same for POPCOUNT. one right shift before calling it.

Count Leading Zeros requires zero extension (zext.w: left shift, right shift) and a subtract, for the XLEN width approach due to sign extension being the normal form (assuming the defined behaviour for zero is to return the word width in bits, which seems to be most common behaviour). 4 instructions. CLZ may benefit from a 32-bit version on RV64. CLZ is more common than CTZ.

Count Trailing Zeros requires left shift and subtract, for the XLEN width approach. 

Alex Elsayed

unread,
Feb 4, 2017, 9:21:27 PM2/4/17
to isa...@groups.riscv.org

On Saturday, 4 February 2017 19:53:26 PST Jacob Bachmeyer wrote:

> Andrew Waterman wrote:

> > Tommy Thorn's original `B' proposal (POPC + BREV + BSWAP) was

> > attractive because it only included instructions that are both useful

> > and expensive to synthesize from `I' instructions.

>

> [apologies in advance for the stream-of-consciousness style; the most

> important bit is after the "SLM" heading]

>

> How do EXTRACT-FIELD and WRITE-FIELD fit into this? On one hand, these

> can be synthesized from AND/OR and shifts, but on the other hand,

> synthesizing them requires forming a mask in a register, which requires

> multiple steps.

>

> EXTRACT-FIELD(source, offset, count) :=

> ((source>>offset)&((1<<(count+1))-1))

>

> WRITE-FIELD(source, offset, count, value) :=

> (((~(((1<<(count+1))-1)<<offset))&source)|((value&((1<<(count+1))-1))<<offse

> t))

 

EXTRACT-FIELD seems to me to be a very inefficiently encoded bit extraction instruction.

 

It seems much more effective to me to have EXTRACT(source, mask), which "pulls" the bits under the ones in the mask to "one side" of the output, while the bits under zeroes go to the other side. Bit ordering is maintained within the two groups.

 

This permits extracting multiple fields simultaneously, so long as they are already in the correct _order_.

 

If they are not, a second pass will pull a subset of _those_ fields to the side, and so forth.

 

An example, with 8-bit registers for illustration:

 

__________12345678____12345678_______368:12457

EXTRACT(0b10101010, 0b00100101) -> 0b100:10011

 

A DEPOSIT(source, mask) which is exactly the inverse seems to be a natural addition to this.

 

In addition, the masks are _very_ likely to be constants.

 

As a result, these _generalize_ EXTRACT-FIELD and WRITE-FIELD, match existing instructions on other architectures better (PDEP and PEXT, from Intel's BMI2), and don't require three logical arguments (which necessitates the encoding proposed below)

 

As a result, the SLM instructions aren't needed (what they're used to implement is just a trivial case of a more powerful instruction), without encoding weirdness.

signature.asc

Michael Clark

unread,
Feb 4, 2017, 9:32:18 PM2/4/17
to Alex Elsayed, isa...@groups.riscv.org
The advantage of SLM is that the control word can be expressed as an immediate and it is very simple.

Parallel Extract (Bit Gather) and Parallel Deposit (Bit Scatter) are powerful intrinsics but they require an XLEN width control word, which means code that uses them needs to load the control word from a constant section (auipc+ld).

Parallel Extract (Bit Gather) and Parallel Deposit (Bit Scatter) are certainly more flexible intrinsics although they likely have high circuit complexity.

> A 12-bit immediate is exactly enough to encode 6-bit offset/count for up
> to RV64I, but that would not help RV128, nor would it help the dynamic
> case, which is exactly where these instructions are most useful, since
> the complicated expressions to form masks can be evaluated at
> compile-time if the offset and count are known. However, WRITE-FIELD is
> fundamentally a 5-operand instruction; while offset and count can be
> combined into a control word, that only reduces it to a 4-operand
> instruction (dest <- WRITE-FIELD(source, offset/count, value)).
>
> Setting the inability to actually encode WRITE-FIELD aside for a moment,
> consider encoding EXTRACT-FIELD. It is an R-type instruction, using a
> 32-bit control word in a register. To improve performance in the common
> case and allow the control word to be loaded using a single instruction,
> I propose permuting the control word bits and assigning bits 0:5 as
> count[0:5] and bits 6:11 as offset[0:5], with higher bits either ignored
> or must-be-zero on RV32/RV64 and bits 12:21 as count[6:15] and bits
> 22:31 as offset[6:15]. This assumes that most bitfields will be less
> than 32 bits long, but may appear anywhere in a 64-bit word. In the
> static case where these assumptions are met, the control word can be
> loaded with a single ADDI and sign extension will fill the top bits with
> zero. In the dynamic case on RV32/RV64, the control word can be formed
> with a SHIFT/OR sequence, which could be a candidate for macro-op
> fusion. The process is a bit more complex for RV128, but EXTRACT-FIELD
> can itself be used to split the count and offset inputs, with SHIFT/OR
> to build the control word.

-- 
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

黃柏瑋

unread,
Feb 4, 2017, 9:49:06 PM2/4/17
to Allen J. Baum, Bryce Lanham, RISC-V ISA Dev, Corey Richardson, Rex McCrary
Hi Allen,
I also missed the first meeting, but I joined the group later.
At last meeting, they discussed principle of selection, target application, and ways for team to collaborate, as I found on the note.

For any foundation member who wants to join, we’re going to have second meeting at Mon, Feb 6 2017 10:30 am to 12:00 am EST.
You can use the workspace to join the group. 

If it is too hurry for you to join, I probably could send you the link and document, but I have to ask the group leader first.

Hope this help.
Feel free to contact me.

best,

Po-wei Huang

Jacob Bachmeyer

unread,
Feb 4, 2017, 10:01:49 PM2/4/17
to Alex Elsayed, isa...@groups.riscv.org
These EXTRACT and DEPOSIT operations are interesting, but I see two
possible problems:

(1) Unlike EXTRACT-FIELD and WRITE-FIELD, which are essentially the
Common Lisp DPB and LDB functions, which themselves were machine
instructions on some ancient architecture, your EXTRACT and DEPOSIT
match recently added x86 instructions. Are you sure there are no
patents that could raise issues?

(2) The additional hardware burden to support SLM/SLMI (and a
corresponding SRM/SRMI "for free") is tiny--as little as a single wire
from instruction decode to ALU with no additional gates. The additional
hardware required to implement EXTRACT and DEPOSIT is significant.

That said, the new shifts are still useful even with your EXTRACT and
DEPOSIT, since they aid in quickly constructing masks in the dynamic
case, such as LZMA decoding, where the sizes of various bit fields are
parameters.


-- Jacob

Allen Baum

unread,
Feb 4, 2017, 10:04:12 PM2/4/17
to 黃柏瑋, Bryce Lanham, RISC-V ISA Dev, Corey Richardson, Rex McCrary
I am a foundation member (through my company) and havebeen  to the last two workshops. I'm not sure what the workspace link is; if its the same link as last meeting, that didn't work. I'll try to log in and see what I can find.

黃柏瑋

unread,
Feb 4, 2017, 10:10:03 PM2/4/17
to Allen Baum, Bryce Lanham, RISC-V ISA Dev, Corey Richardson, Rex McCrary
Hi Allen,
Thanks for your reply.
yes, I knew that you are a foundation member. The workspace is in the RISC-V foundation. 
Just log in,  click the bitmanipulation workgroup ,click join group and then Rex will get your message. The note and ppt are in there.
Hope this help.
Thanks
Po-wei Huang

Jacob Bachmeyer

unread,
Feb 4, 2017, 10:11:52 PM2/4/17
to Michael Clark, 黃柏瑋, RISC-V ISA Dev
Michael Clark wrote:
> - OpenSSL - ciphers and digest algorithms frequently use the rotate
> primitives
> Rotate lacks a compiler intrinsic but can be detected by compilers
> - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62263

Rotate can also be implemented with SHIFT/SHIFT/OR using one scratch
register. Add a fourth opcode that clobbers the scratch register and
you have a plausible candidate for macro-op fusion.

ROTLI:
SRLI t0, source, XLEN-count
SLLI dest, source, count
OR dest, dest, t0
??? t0, ...

The trailing ??? indicates that the following instruction must clobber
t0 without referencing t0, permitting the SRLI/SLLI/OR sequence to be
executed with a single regfile write to dest. This formulation allows a
rotate operation to immediately follow another.

-- Jacob

黃柏瑋

unread,
Feb 4, 2017, 10:28:59 PM2/4/17
to jcb6...@gmail.com, Michael Clark, RISC-V ISA Dev
Hi all,
I found the multithread kind of mix up. However, thanks for everyone who reply for the target application.
I will bring it to the group and cite all of your name.
Moreover, thanks for andrew’s reminder, I will run both conventional benchmark and specific application to provide the data for decision.
Best,
Po-wei Huang

Michael Clark

unread,
Feb 4, 2017, 10:30:38 PM2/4/17
to jcb6...@gmail.com, Alex Elsayed, isa...@groups.riscv.org
Yes, they could indeed be patented. It would require a search. According to the Princeton Papers (*1,*2), these instructions were present on HP PA-RISC perhaps a decade ago. I also note there is a reference to a Bell Paper from 1964 (*3) which may outlines how these circuits work. It would need some research.


(2)  The additional hardware burden to support SLM/SLMI (and a corresponding SRM/SRMI "for free") is tiny--as little as a single wire from instruction decode to ALU with no additional gates.  The additional hardware required to implement EXTRACT and DEPOSIT is significant.

That said, the new shifts are still useful even with your EXTRACT and DEPOSIT, since they aid in quickly constructing masks in the dynamic case, such as LZMA decoding, where the sizes of various bit fields are parameters.


-- Jacob

-- 
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Bruce Hoult

unread,
Feb 5, 2017, 5:43:15 AM2/5/17
to Jacob Bachmeyer, Michael Clark, 黃柏瑋, RISC-V ISA Dev
I hope everyone is at least familiar with PowerPC rlwimi (Rotate Left Word Immediate and Mask Insert) and Aarch64 bfm (BitField Move).

rlwimi has 15 bits of immediate (3x 5 bits for field start, end, and rotate amount) and only addresses 32 bit registers so doesn't fit well in RV.

The ARM64 bfm instruction is I think especially nice. 5 bit src and dst register fields and a 12 bit immediate. Plus two bits in the opcode to indicate whether to take the other dst bits from the existing dst contents or from zero register, and whether to sign or zero extend if inserting into zeroes (so only three of the four combinations are permitted).

The sign extend feature could be dropped, as that's easy enough to do other ways if needed.

Depending on the immediate bits, bfm either moves an arbitrary field to right-aligned, OR a right-aligned value to an arbitrary positiion. The extract operation is also not necessary to emulate, as you can extract an arbitrary field with a left shift followed by a right logical or arithmetic shift (depending on whether you want to zero or sign extend).

So the 12 bit immediate could be a simple shift amount & field size pair. Or, better, rotate amount, eliminating the need for a separate rotate instruction.

Leaving us with:

BFI - Bit Field Insert, and
BFIZ - Bit Field Insert Into Zero

Hmm .. I just talked myself out of BFIZ, as that can also be done by two shifts.

BFI dst, src, #rotate, #size, I format rot[5..0] | size[5..0] | rs1 | funct3 | rd/rs2 | opcode

The only bad thing about it is there aren't any other integer instructions that need to duplicate the dst field to rs2

At the implementation level, are start&end bit positions better than size?

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Andrew Waterman

unread,
Feb 5, 2017, 6:04:42 AM2/5/17
to Bruce Hoult, Jacob Bachmeyer, Michael Clark, RISC-V ISA Dev, 黃柏瑋
On Sun, Feb 5, 2017 at 2:43 AM Bruce Hoult <br...@hoult.org> wrote:
I hope everyone is at least familiar with PowerPC rlwimi (Rotate Left Word Immediate and Mask Insert) and Aarch64 bfm (BitField Move).

rlwimi has 15 bits of immediate (3x 5 bits for field start, end, and rotate amount) and only addresses 32 bit registers so doesn't fit well in RV.

The ARM64 bfm instruction is I think especially nice. 5 bit src and dst register fields and a 12 bit immediate. Plus two bits in the opcode to indicate whether to take the other dst bits from the existing dst contents or from zero register, and whether to sign or zero extend if inserting into zeroes (so only three of the four combinations are permitted).

The sign extend feature could be dropped, as that's easy enough to do other ways if needed.

Depending on the immediate bits, bfm either moves an arbitrary field to right-aligned, OR a right-aligned value to an arbitrary positiion. The extract operation is also not necessary to emulate, as you can extract an arbitrary field with a left shift followed by a right logical or arithmetic shift (depending on whether you want to zero or sign extend).

So the 12 bit immediate could be a simple shift amount & field size pair. Or, better, rotate amount, eliminating the need for a separate rotate instruction.

Leaving us with:

BFI - Bit Field Insert, and
BFIZ - Bit Field Insert Into Zero

Hmm .. I just talked myself out of BFIZ, as that can also be done by two shifts.

BFI dst, src, #rotate, #size, I format rot[5..0] | size[5..0] | rs1 | funct3 | rd/rs2 | opcode

The only bad thing about it is there aren't any other integer instructions that need to duplicate the dst field to rs2

rd-as-implicit-source-operand is the main reason conditional-move isn't in the base ISA.  I advise that B instructions adhere to the base ISA template, else it will restrict uptake of the B extension.

At the implementation level, are start&end bit positions better than size?

On Sun, Feb 5, 2017 at 6:11 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
Michael Clark wrote:
- OpenSSL - ciphers and digest algorithms frequently use the rotate primitives
Rotate lacks a compiler intrinsic but can be detected by compilers - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62263

Rotate can also be implemented with SHIFT/SHIFT/OR using one scratch register.  Add a fourth opcode that clobbers the scratch register and you have a plausible candidate for macro-op fusion.

ROTLI:
   SRLI t0, source, XLEN-count
   SLLI dest, source, count
   OR dest, dest, t0
   ??? t0, ...

The trailing ??? indicates that the following instruction must clobber t0 without referencing t0, permitting the SRLI/SLLI/OR sequence to be executed with a single regfile write to dest.  This formulation allows a rotate operation to immediately follow another.

-- Jacob

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/589697F5.2010301%40gmail.com.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

clifford

unread,
Feb 5, 2017, 6:42:01 AM2/5/17
to RISC-V ISA Dev, poweih...@gmail.com
On Saturday, February 4, 2017 at 8:56:36 PM UTC+1, corey wrote:
It's not much of an application, but as microbenchmarks go you might grab some of the examples from and see if they can be accelerated https://graphics.stanford.edu/~seander/bithacks.html

For completeness: "The Aggregate Magic Algorithms" is another collection like that:

Bruce Hoult

unread,
Feb 5, 2017, 8:12:44 AM2/5/17
to Andrew Waterman, Jacob Bachmeyer, Michael Clark, RISC-V ISA Dev, 黃柏瑋
On Sun, Feb 5, 2017 at 2:04 PM, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
On Sun, Feb 5, 2017 at 2:43 AM Bruce Hoult <br...@hoult.org> wrote:
I hope everyone is at least familiar with PowerPC rlwimi (Rotate Left Word Immediate and Mask Insert) and Aarch64 bfm (BitField Move).

rlwimi has 15 bits of immediate (3x 5 bits for field start, end, and rotate amount) and only addresses 32 bit registers so doesn't fit well in RV.

The ARM64 bfm instruction is I think especially nice. 5 bit src and dst register fields and a 12 bit immediate. Plus two bits in the opcode to indicate whether to take the other dst bits from the existing dst contents or from zero register, and whether to sign or zero extend if inserting into zeroes (so only three of the four combinations are permitted).

The sign extend feature could be dropped, as that's easy enough to do other ways if needed.

Depending on the immediate bits, bfm either moves an arbitrary field to right-aligned, OR a right-aligned value to an arbitrary positiion. The extract operation is also not necessary to emulate, as you can extract an arbitrary field with a left shift followed by a right logical or arithmetic shift (depending on whether you want to zero or sign extend).

So the 12 bit immediate could be a simple shift amount & field size pair. Or, better, rotate amount, eliminating the need for a separate rotate instruction.

Leaving us with:

BFI - Bit Field Insert, and
BFIZ - Bit Field Insert Into Zero

Hmm .. I just talked myself out of BFIZ, as that can also be done by two shifts.

BFI dst, src, #rotate, #size, I format rot[5..0] | size[5..0] | rs1 | funct3 | rd/rs2 | opcode

The only bad thing about it is there aren't any other integer instructions that need to duplicate the dst field to rs2

rd-as-implicit-source-operand is the main reason conditional-move isn't in the base ISA.  I advise that B instructions adhere to the base ISA template, else it will restrict uptake of the B extension

If you can have neither implicit rs2, nor instruction encoding (and read ports) for three source operands (to read an already-prepared bit merge mask from a 3rd register) then you can't do better than an OR for final assembly of the destination with the shifted and masked source field being inserted.

So that leaves such possibilities as helper instructions to:

1) zero out a field in the destination, if needed (it often won't be). This is basically a BIC/ANDNOT of a mask, or AND of the inverse of a mask. (I note in passing that DEC machines such as the PDP11 and VAX had a BIC instruction but didn't have AND).

For RV64I I can't see any way to do this in fewer than five instructions (unless you load a mask from a literal pool, which I assume is undesirable). There are quite a few ways to do it in five, but I think no ways to do it in four without either a rotate or the proposed "shift in 1s" instruction, or BIC.

"Shift in 1s" and a rotate together would permit a three instruction sequence:

slmi t0, zero, #xlen-fieldSize
rli t0, t0, #xlen-fieldStart m  # or rri t0, t0, #fieldStart
and dst, dst, t0

rli would fit in nicely with slli, perhaps using the same bit 30 that distinguishes srli/srai and add/sub.

But then the same would apply to encoding slmi :( They can't both have it.

A single instruction "Clear field" would be much better.

I note that Jacob's MASK-FIELD C code and asm code differ -- the C did a ~ on the mask, but the asm didn't.


2) mask and shift the field to be inserted. This is already only a simple left shift if the field contents are unsigned, or a left shift followed by a logical right shift if the thing to be inserted is signed. Both shifts are 16 bit instructions if you can destroy the source register, so there isn't much room to improve here. A "Shift and mask" instruction would always be a single 32 bit instruction, but preserving the source register would be the only improvement, and it's unneeded if the source is unsigned.

Rex McCrary

unread,
Feb 5, 2017, 3:44:53 PM2/5/17
to Bruce Hoult, Andrew Waterman, Jacob Bachmeyer, Michael Clark, RISC-V ISA Dev, 黃柏瑋
Our Bit Manipulation Meetings are as follows:

If you are a member of the task group you should have already received this notification.  If not, please join the Foundation and the bit manipulation task group.  If you are in the process of joining and that process is just not complete, let me know and I will try to make sure you have what you need for tomorrow.  I am unable to guarantee it, but I will try..

Rex McCrary

AMD

Bit Manipulation Task Group Chair


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jacob Bachmeyer

unread,
Feb 5, 2017, 9:00:55 PM2/5/17
to Bruce Hoult, Andrew Waterman, Michael Clark, RISC-V ISA Dev, 黃柏瑋
Bruce Hoult wrote:
> So that leaves such possibilities as helper instructions to:
>
> 1) zero out a field in the destination, if needed (it often won't be).
> This is basically a BIC/ANDNOT of a mask, or AND of the inverse of a
> mask. (I note in passing that DEC machines such as the PDP11 and VAX
> had a BIC instruction but didn't have AND).
>
> For RV64I I can't see any way to do this in fewer than five
> instructions (unless you load a mask from a literal pool, which I
> assume is undesirable). There are quite a few ways to do it in five,
> but I think no ways to do it in four without either a rotate or the
> proposed "shift in 1s" instruction, or BIC.
>
> "Shift in 1s" and a rotate together would permit a three instruction
> sequence:
>
> slmi t0, zero, #xlen-fieldSize
> rli t0, t0, #xlen-fieldStart m # or rri t0, t0, #fieldStart
> and dst, dst, t0
>
> rli would fit in nicely with slli, perhaps using the same bit 30 that
> distinguishes srli/srai and add/sub.
>
> But then the same would apply to encoding slmi :( They can't both have it.

I see a solution here. I proposed using bit 29 to select
"shift-in-ones", but bit 30 and bit 29 both set would suggest
"arithmetic shift mask" which is nonsensical (and undefined for a left
shift). At a slight increase in decode complexity, the combination of
bit 30 and bit 29 could represent "rotate". The minor opcode in funct3
already determines left/right. The major opcode already determines
whether to use an immediate or an rs2. So adding "shift-in-ones" really
adds SLM/SRM/SLMI/SRMI.

My previous statement that there is a large amount of space in funct7 on
shifts was incorrect--RV64I adds bit 25 to the shift amount and RV128I
can be expected to add bit 26 to the shift amount. That leaves bits 27,
28 and 31 for further "shift-like" operations, but bit 31 has a high
fanout. Using bits 30:29==2'b01 to select "shift-in-ones" and bits
30:29==2'b11 to select "rotate" would make efficient use of this
encoding space, leaving three bits for future extensions.

> A single instruction "Clear field" would be much better.

That runs into the same encoding difficulties as EXTRACT-FIELD,
requiring a combined offset/length control word in a register.

> I note that Jacob's MASK-FIELD C code and asm code differ -- the C did
> a ~ on the mask, but the asm didn't.

That was an error in my asm code. There should have been a "NOT t0, t0"
just before the AND in the MASK-FIELD example. MASK-FIELD was really a
helper operation for WRITE-FIELD and I did not give the same attention
to proofreading it as the other two. My mistake.


-- Jacob

Bruce Hoult

unread,
Feb 6, 2017, 5:52:21 PM2/6/17
to Jacob Bachmeyer, Andrew Waterman, Michael Clark, RISC-V ISA Dev, 黃柏瑋
On Mon, Feb 6, 2017 at 5:00 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
Bruce Hoult wrote:
So that leaves such possibilities as helper instructions to:

1) zero out a field in the destination, if needed (it often won't be). This is basically a BIC/ANDNOT of a mask, or AND of the inverse of a mask. (I note in passing that DEC machines such as the PDP11 and VAX had a BIC instruction but didn't have AND).

For RV64I I can't see any way to do this in fewer than five instructions (unless you load a mask from a literal pool, which I assume is undesirable). There are quite a few ways to do it in five, but I think no ways to do it in four without either a rotate or the proposed "shift in 1s" instruction, or BIC.

"Shift in 1s" and a rotate together would permit a three instruction sequence:

slmi t0, zero, #xlen-fieldSize
rli t0, t0, #xlen-fieldStart m  # or rri t0, t0, #fieldStart
and dst, dst, t0

rli would fit in nicely with slli, perhaps using the same bit 30 that distinguishes srli/srai and add/sub.

But then the same would apply to encoding slmi :( They can't both have it.

I see a solution here.  I proposed using bit 29 to select "shift-in-ones", but bit 30 and bit 29 both set would suggest "arithmetic shift mask" which is nonsensical (and undefined for a left shift).  At a slight increase in decode complexity, the combination of bit 30 and bit 29 could represent "rotate".  The minor opcode in funct3 already determines left/right.  The major opcode already determines whether to use an immediate or an rs2.  So adding "shift-in-ones" really adds SLM/SRM/SLMI/SRMI.

My previous statement that there is a large amount of space in funct7 on shifts was incorrect--RV64I adds bit 25 to the shift amount and RV128I can be expected to add bit 26 to the shift amount.  That leaves bits 27, 28 and 31 for further "shift-like" operations, but bit 31 has a high fanout.  Using bits 30:29==2'b01 to select "shift-in-ones" and bits 30:29==2'b11 to select "rotate" would make efficient use of this encoding space, leaving three bits for future extensions.

A single instruction "Clear field" would be much better.

That runs into the same encoding difficulties as EXTRACT-FIELD, requiring a combined offset/length control word in a register.

 rs1, rd, and offset/length in the 12 bit immediate. I format.

Perfect for RV32 & RV64. RV128 can deal with up to a 64 bit field with a preshift (which could be macro op fused).

You could have an R format version with offset/length in a register as you suggest, but I think the dynamic case can be discounted as very rare. PowerPC and Aarch64 don't provide for a dynamic case.


I note that Jacob's MASK-FIELD C code and asm code differ -- the C did a ~ on the mask, but the asm didn't.

That was an error in my asm code.  There should have been a "NOT t0, t0" just before the AND in the MASK-FIELD example.  MASK-FIELD was really a helper operation for WRITE-FIELD and I did not give the same attention to proofreading it as the other two.  My mistake.

 

I was looking at all the bit manipulation instructions Intel and/or AMD have added over the years:


POPCNT and LZCNT are essential and uncontroversial. I can't imagine any proposal omitting them.

PDEP and PEXT do the same kind of thing we're talking about here, but more flexibly. The hardware complexity must be very high, with the shift amount of each bit depending on how many bits to its right are set! You could not do them on RV as an immediate instruction as you need a full width immediate. Maybe they'd be good for a dynamic version. (they've already been discussed by others in this thread)

I was interested to see AMD added ANDN (which I usually call BIC, as I started on PDP11/VAX). I've always liked this instruction, and it would eliminate the NOT you forget in your assembly code :-) I'd like to see ORN as well.

I'd like to see a rotate available in some form. It can be a special case of EXTRACT-FIELD, with xLen field size, if EXTRACT-FIELD rotates rather than shifts.

I'm not impressed by the other instructions in BMI1, BMI2, and TBM. They're too easily efficiently replaced by the given simple C expressions -- no more than two instructions if you have ORN and ANDN.

Andrew Waterman

unread,
Feb 6, 2017, 5:58:07 PM2/6/17
to Bruce Hoult, Jacob Bachmeyer, Michael Clark, RISC-V ISA Dev, 黃柏瑋
Tommy Thorn's fine minimalist solution omits CLZ by using POPC to
implement CTZ and BREV to implement CLZ in terms of that.

>
> PDEP and PEXT do the same kind of thing we're talking about here, but more
> flexibly. The hardware complexity must be very high, with the shift amount
> of each bit depending on how many bits to its right are set! You could not
> do them on RV as an immediate instruction as you need a full width
> immediate. Maybe they'd be good for a dynamic version. (they've already been
> discussed by others in this thread)
>
> I was interested to see AMD added ANDN (which I usually call BIC, as I
> started on PDP11/VAX). I've always liked this instruction, and it would
> eliminate the NOT you forget in your assembly code :-) I'd like to see ORN
> as well.
>
> I'd like to see a rotate available in some form. It can be a special case of
> EXTRACT-FIELD, with xLen field size, if EXTRACT-FIELD rotates rather than
> shifts.
>
> I'm not impressed by the other instructions in BMI1, BMI2, and TBM. They're
> too easily efficiently replaced by the given simple C expressions -- no more
> than two instructions if you have ORN and ANDN.
>
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkzea25Zz0KSY%2B6fpKAU8Nr%2B5%3D5AnL9fkRvq2BaUc8cFFw%40mail.gmail.com.

Bruce Hoult

unread,
Feb 6, 2017, 6:17:11 PM2/6/17
to Andrew Waterman, Jacob Bachmeyer, Michael Clark, RISC-V ISA Dev, 黃柏瑋
I haven't seen that, but sure, popcount(x ^ (x-1)) counts trailing zeros. That works, but it makes CLZ a four instruction sequence. Of course that's a lot better than you can do with RVI.
  

Fabjan Sukalia

unread,
Feb 6, 2017, 6:18:01 PM2/6/17
to Bruce Hoult, Jacob Bachmeyer, Andrew Waterman, Michael Clark, RISC-V ISA Dev, 黃柏瑋

2017-02-06 23:52 GMT+01:00 Bruce Hoult <br...@hoult.org>:
I was interested to see AMD added ANDN (which I usually call BIC, as I started on PDP11/VAX). I've always liked this instruction, and it would eliminate the NOT you forget in your assembly code :-) I'd like to see ORN as well.

I agree with ANDN/BIC. ANDN would be helpful for microcontrollers because you can easily clear bits in a register which is common. Maybe also ANDNI so it is easier to clear the lower bits. 

ANDN can be easily made with AND and NOT (XORI with -1 immediate) but surprisingly RVC doesn't have a C.NOT or C.XORI (or did I miss something?).

--
Fabjan Sukalia


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jacob Bachmeyer

unread,
Feb 6, 2017, 9:06:15 PM2/6/17
to Bruce Hoult, Andrew Waterman, Michael Clark, RISC-V ISA Dev, 黃柏瑋
I see the dynamic case as exactly what makes bit field access
instructions useful--they can be synthesized using RVI, but most of the
steps are forming the mask. In the static case, the compiler can do
that and generate LI to load a mask calculated in advance. In the
dynamic case, load-1/shift-left/subtract-1/shift-left all have to be
done. The "shift-in-ones" instruction I propose replaces the
load-1/shift-left/subtract-1 sequence and also allows two more
single-instruction expansions for LI: If the immediate is a contiguous
group of ones on either side, use SLMI/SRMI from x0.

While the load-1/shift-left/subtract-1 sequence could be a candidate for
macro-op fusion, the circuitry required to implement that is far more
complex than the circuitry required to implement SLM/SRM/SLMI/SRMI. The
former would require the instruction decoder to recognize a
three-instruction sequence, while the latter requires only a wire from
the decoder to the ALU in the simplest implementations, one extra bit in
certain pipeline latches in a pipelined implementation, and analogous
costs that are trivial in comparison to the inherent complexity of
out-of-order implementations.

> I note that Jacob's MASK-FIELD C code and asm code differ --
> the C did a ~ on the mask, but the asm didn't.
>
>
> That was an error in my asm code. There should have been a "NOT
> t0, t0" just before the AND in the MASK-FIELD example. MASK-FIELD
> was really a helper operation for WRITE-FIELD and I did not give
> the same attention to proofreading it as the other two. My mistake.
>
>
> I was looking at all the bit manipulation instructions Intel and/or
> AMD have added over the years:
>
> https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets

Most of these seem to be good candidates for macro-op fusion instead of
distinct instructions. Keeping B small will help its adoption and I
would like to see "GCB" as the common implementation.

> POPCNT and LZCNT are essential and uncontroversial. I can't imagine
> any proposal omitting them.

I understand that POPCOUNT was in the original "B" proposal; no
objection. LZCNT is probably also important, but I would ask why it was
omitted from the original proposal.

> PDEP and PEXT do the same kind of thing we're talking about here, but
> more flexibly. The hardware complexity must be very high, with the
> shift amount of each bit depending on how many bits to its right are
> set! You could not do them on RV as an immediate instruction as you
> need a full width immediate. Maybe they'd be good for a dynamic
> version. (they've already been discussed by others in this thread)

The amazingly high hardware complexity is the main problem I have with
these operations. They also seem to be new-ish and I will be concerned
about possible patents until I am proven wrong about the novelty of
these instructions.

> I was interested to see AMD added ANDN (which I usually call BIC, as I
> started on PDP11/VAX). I've always liked this instruction, and it
> would eliminate the NOT you forget in your assembly code :-) I'd like
> to see ORN as well.

AND-NOT and OR-NOT could be macro-op fusion pairs, or even a general
AND-XOR and OR-XOR as macro-op fusion pairs, since NOT in RISC-V is
actually XORI. Is anyone keeping a list of possible macro-op fusion
candidates to ensure that GCC code generation takes advantage of these
opportunities as soon as implementations do?

The advantage of actual AND-NOT and OR-NOT opcodes would be a slight
reduction in register pressure, if both inputs remain live. The catch
is that the ALU circuits required to implement AND-NOT and OR-NOT are
essentially the same as implementing the AND-XOR and OR-XOR macro-op
fusion pairs and this probably adds an XOR gate delay to the critical
path for bitwise operations. While that may not be the ALU critical
path, I am reluctant to support adding this burden to all implementations.

> I'd like to see a rotate available in some form. It can be a special
> case of EXTRACT-FIELD, with xLen field size, if EXTRACT-FIELD rotates
> rather than shifts.

I propose explicit rotates, using the otherwise forbidden encoding of
bits 30:29==2'b11. Another option could be to offer only left-rotate,
using bit 30 to make (nonsensical) "left arithmetic shift" decode as
"left rotate". I think that adding the "shift-mask" instructions and
using (nonsensical) "arithmetic shift mask" to encode "rotate" is a
cleaner and more symmetrical option.

> I'm not impressed by the other instructions in BMI1, BMI2, and TBM.
> They're too easily efficiently replaced by the given simple C
> expressions -- no more than two instructions if you have ORN and ANDN.

Well, EXTRACT-FIELD (BEXTR) is already being discussed, and I think that
I gave a good argument for implementing SLM instead, since it simplifies
EXTRACT-FIELD to three instructions in both static and dynamic cases
with a very small implementation burden.

BLSI, BLSMSK, BLSR are all 2-instruction sequences that are candidates
for macro-op fusion if the source and destination registers differ.

BLSI:
NEG dest, source
AND dest, dest, source
BLSMSK:
SUBI dest, source, 1
XOR dest, dest, source
BLSR:
SUBI dest, source, 1
AND dest, dest, source

TZCNT seems to be symmetric to LZCNT, and I suspect that the original B
proposal omitted it for the same reason.

BZHI is a special case of CLEAR-FIELD, which is a special case of
WRITE-FIELD. I see no reason to want this.

MULX, RORX, SARX, SHRX, SHLX are all special cases of other x86
instructions that do not update the condition code register, a concept
that RISC-V intentionally omits entirely, so these are irrelevant.

PARALLEL-DEPOSIT (PDEP) and PARALLEL-EXTRACT (PEXT) are also already
being discussed.

BLCFILL, BLCMSK, BLCS, BLSFILL are all 2-instruction sequences that are
candidates for macro-op fusion if the source and destination registers
differ.

BLCFILL:
ADDI dest, source, 1
AND dest, dest, source
BLCMSK:
ADDI dest, source, 1
XOR dest, dest, source
BLCS:
ADDI dest, source, 1
OR dest, dest, source
BLSFILL:
SUBI dest, source, 1
OR dest, dest, source

BLCI, BLCIC, BLSIC, T1MSKC, TZMSK are all 2-instruction sequences with
AND-NOT and OR-NOT, and 3-instruction sequences otherwise. Provided
that source and destination registers differ, BLCI is a candidate for
macro-op fusion. The others are candidates for macro-op fusion if
AND-NOT and OR-NOT are explicit instructions, otherwise they require a
scratch register.

# assuming that AND-NOT and OR-NOT are formed using macro-op fusion
BLCI:
ADDI dest, source, 1
NOT dest, dest
OR dest, dest, source
BLCIC:
ADDI t0, source, 1
NOT dest, source
AND dest, dest, t0
BLSIC:
SUBI t0, source, 1
NOT dest, source
OR dest, dest, t0
T1MSKC:
ADDI t0, source, 1
NOT dest, source
OR dest, dest, t0
TZMSK:
SUBI t0, source, 1
NOT dest, source
AND dest, dest, t0

# assuming explicit AND-NOT and OR-NOT that invert the second input value
BLCI:
ADDI dest, source, 1
ORN dest, source, dest
BLCIC:
ADDI dest, source, 1
ANDN dest, dest, source
BLSIC:
SUBI dest, source, 1
ORN dest, dest, source
T1MSKC:
ADDI dest, source, 1
ORN dest, dest, source
TZMSK:
SUBI dest, source, 1
ANDN dest, dest, source


Having just written the above (as proof of my assertions that these are
simple operations), I wonder if generic rules for synthesizing (by a
compiler) and recognizing (by hardware) macro-op fusion sequences may be
a more fruitful approach than trying to build a list? Especially if the
hardware can recognize a sequence that uses a scratch register that is
immediately clobbered by the following instruction, which may itself be
part of a following fusion sequence. This could lead to advanced
implementations with complex dynamic data paths that build large
macro-ops spanning a half-dozen or so instructions, while still allowing
the same code to run on simple implementations.

-- Jacob

Bruce Hoult

unread,
Feb 7, 2017, 4:25:25 AM2/7/17
to Jacob Bachmeyer, Andrew Waterman, Michael Clark, RISC-V ISA Dev, 黃柏瑋
I don't really agree with this about the static case.

Sure, masks of 11 bits or fewer can come directly from a LI and moved anywhere in a 32/64/128 bit register with a shift. masks up to 32 bits can be made with a LUI/ADDI (plus a shift for 64/128).

So you're up to 2 or 3 instructions already and what you want most is the negative mask for clearing the field in the destination if you don't have ANDN, so that needs either another NOT or your "shift in 1s".

The instruction count is rising rapidly, compared to a "clear-field" single instruction.

Note that arbitrarily sized and positioned masks can be made in three instructions, even in RV128 with NOT ZERO, followed by a shift one way to shorten it, and then a shift the other way to position it. That works both static and dynamic.

But then you still need another NOT and AND to clear the old contents from the field.

If you have rotate then the NOT can be eliminated by changing the mask generation to NOT ZERO, shift (shortening), rotate (positioning).

 
The "shift-in-ones" instruction I propose replaces the load-1/shift-left/subtract-1 sequence and also allows two more single-instruction expansions for LI:  If the immediate is a contiguous group of ones on either side, use SLMI/SRMI from x0.

I do like the "shift-in-ones" instruction. I just don't think it's enough.


Perhaps there needs to be some general guideline of how much instruction sequences need to be shortened by in order to justify a new instruction?

POPC, CLZ, PDEP, PEXT all replace very long sequences (or even loops). It seems clear that if those operations are sufficiently common to care about at all then either a single instruction or helper instructions are justified. (I'm not convinced the added generality of PDEP & PEXT vs field-extract and field-insert would be used)

ANDN and ORN are quite trivial and don't shorten programs very much by themselves, but they do open up a lot more possibilities for subsequent 2-instruction fusing.

Maybe I'm wrong, but my intuition says that recognising and fusing 3 or 4 or 5 instruction sequences is unlikely to be practical.


We're restricting ourselves from a single-instruction bit-field-move or rotate-and-mask-insert because those require either three register operands (or two and a large immediate), or else the destination register being an implicit source register (as both PPC and Aarch64 do). Implemented in RVI this is eight instructions. We're not allowing ourselves to get it down to 1, but it can be three using suitable helpers (clear-field and mask-and-shift, followed by OR). Or four using only clear-field (three if the value to be inserted is unsigned and so doesn't need masking).

"shift-in-ones" by itself only reduces eight instructions to seven.

Adding either rotate or ANDN reduces it to six -- as also does rotate and ANDN but without shift-in-ones.

Jacob Bachmeyer

unread,
Feb 7, 2017, 7:30:52 PM2/7/17
to Bruce Hoult, Andrew Waterman, Michael Clark, RISC-V ISA Dev, 黃柏瑋
Bruce Hoult wrote:
> On Tue, Feb 7, 2017 at 5:06 AM, Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Bruce Hoult wrote:
>
> On Mon, Feb 6, 2017 at 5:00 AM, Jacob Bachmeyer
> <jcb6...@gmail.com <mailto:jcb6...@gmail.com>
The compiler can also choose to load the mask from a constant pool--one
LOAD instruction for any mask. This is not an option in the dynamic case.

> Note that arbitrarily sized and positioned masks can be made in three
> instructions, even in RV128 with NOT ZERO, followed by a shift one way
> to shorten it, and then a shift the other way to position it. That
> works both static and dynamic.
>
> But then you still need another NOT and AND to clear the old contents
> from the field.
>
> If you have rotate then the NOT can be eliminated by changing the mask
> generation to NOT ZERO, shift (shortening), rotate (positioning).

For the special case of generating a negative mask, "shift-in-1s" can
replace rotate for positioning: NOT ZERO, shift-in-0 (lengthening),
shift-in-1 (positioning).

>
> The "shift-in-ones" instruction I propose replaces the
> load-1/shift-left/subtract-1 sequence and also allows two more
> single-instruction expansions for LI: If the immediate is a
> contiguous group of ones on either side, use SLMI/SRMI from x0.
>
>
> I do like the "shift-in-ones" instruction. I just don't think it's enough.
>
> Perhaps there needs to be some general guideline of how much
> instruction sequences need to be shortened by in order to justify a
> new instruction?

I think of it as "benefits (shortened programs) versus costs
(implementation complexity)". The "shift-in-1s" instructions have
almost trivial incremental implementation complexity and the proposed
encoding for "shift-in-1s" also leads (by creating an invalid case) to
an encoding for rotate operations. I proposed it as a minimal-cost
alternative to complex field-access instructions that is generic enough
to have other uses. I expect that the hardware cost to implement
"shift-in-1s" is less than the cost to implement "rotate".

> POPC, CLZ, PDEP, PEXT all replace very long sequences (or even loops).
> It seems clear that if those operations are sufficiently common to
> care about at all then either a single instruction or helper
> instructions are justified. (I'm not convinced the added generality of
> PDEP & PEXT vs field-extract and field-insert would be used)

POPC and CLZ are also relatively simple; apparently the original "B"
proposal derived CLZ from BREV+CTZ and CTZ from POPC. I am not quite
sure how and my efforts to find this proposal have come up empty. PDEP
and PEXT are anything but simple in an efficient implementation and I
think that they are probably best omitted due to their enormous complexity.

> ANDN and ORN are quite trivial and don't shorten programs very much by
> themselves, but they do open up a lot more possibilities for
> subsequent 2-instruction fusing.

While these can eliminate scratch registers from some sequences and the
encoding atop the existing AND and OR could be very simple (bit
30==invert rs2 if set; bit 29==invert rs1 if set; invert both inputs if
both bits are set), this effectively adds an XOR gate delay to the ALU
path for bitwise operations. If only a single input can be inverted, it
might be possible to reuse the ALU's primary XOR function for this,
which would also allow AND-XOR and OR-XOR fusion pairs, but I think that
this would still mean that the ALU must run logical operations through
the chain XOR/{AND/OR/...}/MUX, while logical operations could otherwise
use {AND/OR/XOR/...}/MUX where the final MUX is the ALU output
multiplexer. Then again, clever ALU design may be able to mitigate this
cost or even turn it to an advantage. I admit that I do not know.

> Maybe I'm wrong, but my intuition says that recognising and fusing 3
> or 4 or 5 instruction sequences is unlikely to be practical.

For very high performance multi-pipeline out-of-order processors, it
could be a reasonable incremental increase in complexity. Conceptually,
it is similar to parallel decoding of variable-length instructions,
which current high-performance x86 processors do.

> We're restricting ourselves from a single-instruction bit-field-move
> or rotate-and-mask-insert because those require either three register
> operands (or two and a large immediate), or else the destination
> register being an implicit source register (as both PPC and Aarch64
> do). Implemented in RVI this is eight instructions. We're not allowing
> ourselves to get it down to 1, but it can be three using suitable
> helpers (clear-field and mask-and-shift, followed by OR). Or four
> using only clear-field (three if the value to be inserted is unsigned
> and so doesn't need masking).

I think that I called these MASK-FIELD and EXTRACT-FIELD, and agree that
they would be useful and are probably about as good as RISC-V can get in
this direction, but I consider the need for a combined field descriptor
as a bit of a wart. (Which I proposed with an odd layout for a reason:
"ADDI dest, x0, <control word> / EXTRACT-FIELD dest, source, dest" makes
a nice fusion pair.) Overall, I expect that EXTRACT-FIELD is probably
more valuable than MASK-FIELD, because it is useful for reading
bit-packed data and I suspect that decompression is more frequent than
compression. That said, once we have paid the weirdness cost for
EXTRACT-FIELD, MASK-FIELD is very little more.


-- Jacob

Bruce Hoult

unread,
Feb 7, 2017, 7:57:47 PM2/7/17
to Jacob Bachmeyer, Andrew Waterman, Michael Clark, RISC-V ISA Dev, 黃柏瑋
On Wed, Feb 8, 2017 at 3:30 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
I don't really agree with this about the static case.

Sure, masks of 11 bits or fewer can come directly from a LI and moved anywhere in a 32/64/128 bit register with a shift. masks up to 32 bits can be made with a LUI/ADDI (plus a shift for 64/128).

So you're up to 2 or 3 instructions already and what you want most is the negative mask for clearing the field in the destination if you don't have ANDN, so that needs either another NOT or your "shift in 1s".

The instruction count is rising rapidly, compared to a "clear-field" single instruction.

The compiler can also choose to load the mask from a constant pool--one LOAD instruction for any mask.  This is not an option in the dynamic case.

Yes, but many or most implementations will have 2 or 3 cycle delays loading from L1 cache. making that more expensive than a sequence of non-memory instructions.

The constant pool approach worked great on cache-less ARM7TDMIs, not so well recently...
 

Perhaps there needs to be some general guideline of how much instruction sequences need to be shortened by in order to justify a new instruction?

I think of it as "benefits (shortened programs) versus costs (implementation complexity)".  The "shift-in-1s" instructions have almost trivial incremental implementation complexity and the proposed encoding for "shift-in-1s" also leads (by creating an invalid case) to an encoding for rotate operations.  I proposed it as a minimal-cost alternative to complex field-access instructions that is generic enough to have other uses.  I expect that the hardware cost to implement "shift-in-1s" is less than the cost to implement "rotate".

Sure. Though if you already have arbitrary shift then rotate is just adding wires, right? Albeit wires that cross over the other wires. I don't know how hard that really is at the chip level.
 
POPC, CLZ, PDEP, PEXT all replace very long sequences (or even loops). It seems clear that if those operations are sufficiently common to care about at all then either a single instruction or helper instructions are justified. (I'm not convinced the added generality of PDEP & PEXT vs field-extract and field-insert would be used)

POPC and CLZ are also relatively simple; apparently the original "B" proposal derived CLZ from BREV+CTZ and CTZ from POPC.  I am not quite sure how and my efforts to find this proposal have come up empty.

It's very simple. and I worked it out and explained it in a previous post.

n ^ (n-1) gives you 1s in the LSBs up to and including the smallest set bit. because it changes that smallest set bit from 1 to 0 and all lower bits from 0 to 1. Thus all those give 1 when XORed with the original, while the higher unchanged bits give 0s when XORed with themselves.

Then a POPC gives you one more than the number of trailing zeroes.

Reversing the bits and then CTZ is obviously the same as CLZ.

 
  PDEP and PEXT are anything but simple in an efficient implementation and I think that they are probably best omitted due to their enormous complexity.

Agree.

 

ANDN and ORN are quite trivial and don't shorten programs very much by themselves, but they do open up a lot more possibilities for subsequent 2-instruction fusing.

While these can eliminate scratch registers from some sequences and the encoding atop the existing AND and OR could be very simple (bit 30==invert rs2 if set; bit 29==invert rs1 if set; invert both inputs if both bits are set), this effectively adds an XOR gate delay to the ALU path for bitwise operations.

Yes, but that's still far short of the gate delays of ADD, which has to be able to propagate a carry all the way from bit 0 to bit 31, 63, or 127.

Inverting both inputs of AND of course gives you NOR, and inverting both inputs of OR gives you NAND. Which could be useful. No need to have an option to invert the outputs.

Inverting both inputs of XOR is useless. Inverting one input is the same as inverting the output.

Jacob Bachmeyer

unread,
Feb 7, 2017, 9:20:17 PM2/7/17
to Bruce Hoult, Andrew Waterman, Michael Clark, RISC-V ISA Dev, 黃柏瑋
Bruce Hoult wrote:
> On Wed, Feb 8, 2017 at 3:30 AM, Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> I don't really agree with this about the static case.
>
> Sure, masks of 11 bits or fewer can come directly from a LI
> and moved anywhere in a 32/64/128 bit register with a shift.
> masks up to 32 bits can be made with a LUI/ADDI (plus a shift
> for 64/128).
>
> So you're up to 2 or 3 instructions already and what you want
> most is the negative mask for clearing the field in the
> destination if you don't have ANDN, so that needs either
> another NOT or your "shift in 1s".
>
> The instruction count is rising rapidly, compared to a
> "clear-field" single instruction.
>
>
> The compiler can also choose to load the mask from a constant
> pool--one LOAD instruction for any mask. This is not an option in
> the dynamic case.
>
>
> Yes, but many or most implementations will have 2 or 3 cycle delays
> loading from L1 cache. making that more expensive than a sequence of
> non-memory instructions.

In a simple implementation, yes, but an implementation that can execute
other instructions while a LOAD is pending can probably hide that
latency if the LOAD is scheduled 4 or 5 instructions before the mask is
needed. RISC-V has enough registers that this should not produce
excessive register pressure.

> The constant pool approach worked great on cache-less ARM7TDMIs, not
> so well recently...

This could be a problem. As I understand, constant pools are the
preferred means (the other means being LUI/ADDI/SHIFT/LUI/ADDI/OR) to
load general 64-bit values in RV64...

> Perhaps there needs to be some general guideline of how much
> instruction sequences need to be shortened by in order to
> justify a new instruction?
>
>
> I think of it as "benefits (shortened programs) versus costs
> (implementation complexity)". The "shift-in-1s" instructions have
> almost trivial incremental implementation complexity and the
> proposed encoding for "shift-in-1s" also leads (by creating an
> invalid case) to an encoding for rotate operations. I proposed it
> as a minimal-cost alternative to complex field-access instructions
> that is generic enough to have other uses. I expect that the
> hardware cost to implement "shift-in-1s" is less than the cost to
> implement "rotate".
>
>
> Sure. Though if you already have arbitrary shift then rotate is just
> adding wires, right? Albeit wires that cross over the other wires. I
> don't know how hard that really is at the chip level.

And multiplexers... and the control logic to drive the multiplexers...
Although I think that at least some single-cycle arbitrary shifter
topologies already pay most of this price to be able to do both left and
right shifts, so only the control logic would be affected by adding
rotate, at least for those. But those are pretty big, and simpler
implementations may have different trade-offs. Put another way: in
big, high-performance implementations, rotate is relatively cheap, in
tiny, minimal-area implementations, it could be more expensive. Then
again, the tiny implementations could simply emulate rotate with a
monitor trap (or even an "invisible" monitor trap to an internal ROM
using a few shadow registers--emulating rotate is only
ADDI/SHIFT/SHIFT/OR with one scratchpad).

> POPC, CLZ, PDEP, PEXT all replace very long sequences (or even
> loops). It seems clear that if those operations are
> sufficiently common to care about at all then either a single
> instruction or helper instructions are justified. (I'm not
> convinced the added generality of PDEP & PEXT vs field-extract
> and field-insert would be used)
>
>
> POPC and CLZ are also relatively simple; apparently the original
> "B" proposal derived CLZ from BREV+CTZ and CTZ from POPC. I am
> not quite sure how and my efforts to find this proposal have come
> up empty.
>
>
> It's very simple. and I worked it out and explained it in a previous post.
>
> n ^ (n-1) gives you 1s in the LSBs up to and including the smallest
> set bit. because it changes that smallest set bit from 1 to 0 and all
> lower bits from 0 to 1. Thus all those give 1 when XORed with the
> original, while the higher unchanged bits give 0s when XORed with
> themselves.
>
> Then a POPC gives you one more than the number of trailing zeroes.
>
> Reversing the bits and then CTZ is obviously the same as CLZ.

So CTZ is the BLSMSK fusion pair, followed by POPC. Prefixing the whole
sequence with BREV converts it to a 4-instruction CLZ, but only CTZ
could be a macro-op fusion candidate, since BREV produces a temporary
that must remain live until just before POPC.

> ANDN and ORN are quite trivial and don't shorten programs very
> much by themselves, but they do open up a lot more
> possibilities for subsequent 2-instruction fusing.
>
>
> While these can eliminate scratch registers from some sequences
> and the encoding atop the existing AND and OR could be very simple
> (bit 30==invert rs2 if set; bit 29==invert rs1 if set; invert both
> inputs if both bits are set), this effectively adds an XOR gate
> delay to the ALU path for bitwise operations.
>
>
> Yes, but that's still far short of the gate delays of ADD, which has
> to be able to propagate a carry all the way from bit 0 to bit 31, 63,
> or 127.

The carry does not have to propagate directly; look up carry look-ahead
adders; Wikipedia has an article explaining the Kogge-Stone adder as one
example.

> Inverting both inputs of AND of course gives you NOR, and inverting
> both inputs of OR gives you NAND. Which could be useful. No need to
> have an option to invert the outputs.
>
> Inverting both inputs of XOR is useless. Inverting one input is the
> same as inverting the output.

OR and AND are in a neat corner of the encoding space, one bit away from
SLT/SLTU. For ADD, bit 30 applies 2's-complement to rs2. For AND-NOT
and OR-NOT, bit 30 could apply 1's-complement to rs2 and bit 29 could
apply 1's-complement to rs1. Bit 12 nicely distinguishes these
cases--it is set for both AND and OR, but cleared for ADD and SHIFT.
Bit 12 is clear in XOR. Bit 29 could be used to apply 1's-complement to
rs1 to produce XNOR, but I am unsure what uses an explicit XNOR would
have. Bit 30 could be used to apply 2's-complement to rs2 with XOR, but
I am unsure of the use of this and suspect that it may impose excessive
constraints on ALU topology.


-- Jacob

Andrew Waterman

unread,
Feb 7, 2017, 9:34:36 PM2/7/17
to Jacob Bachmeyer, Bruce Hoult, Michael Clark, RISC-V ISA Dev, 黃柏瑋
Loads tend to be >25% of dynamic instructions, so it is a safe bet
that implementations will go to great lengths to execute loads quickly
in the common case. I wouldn't ring the death knell of the constant
pool just yet!

Allen Baum

unread,
Feb 7, 2017, 10:40:00 PM2/7/17
to Bruce Hoult, Jacob Bachmeyer, Andrew Waterman, Michael Clark, RISC-V ISA Dev, 黃柏瑋
A bunch of implementation comments; 
My abject apologies in advance if these are obvious (Iwhich will be the case for many of you) or they've already been discussed, but I missed the first couple of meetings and its tough keeping up with the volume of email,:

Rotate doesn't really need to add wire area.- or even transistor area, depending on how the shift is implemented. In a full barrel shift, you have a triangular array. 
If you just fill in the unused half of the triangle, you get a rotater in a rectangular array pretty much for free (my assumption here is that the other diagonal half of the shifter would likely be white-space, and not terribly useful in a compact datapath layout).

 I think the same is true if you have a logN mux style shifter.
Note that the layout almost certainly have the two primary inputs bit interleaved, not one to the right or left of the other - but you probably know that.

So, by selecting/ preconditioning the inputs, the already existing shifter gives you all the functionality you need for basic field ops (but not bit gather/scatter/rotate/parallel extrac-deposit/permute, clz, etc). 

If you clear the inputs to one half or the other and it can then effectively shift in either direction. 
If you put different register sources into each half, then it can perform a double shift (i.e. extract a fullword output that crosses register boundaries, pretty useful for aligning operands in unfriendly code)

Adding masking logic to the output (mask generation is performed in parallel, so its an additional AND gate delay) lets you do arbitrary field extract and right justifty. And the masking logic will let you generate masks as well.

Field insert is a bit more complicated, but doesn't take much more area or delay

IF one of the operands is a (short) signed-extended constant, you get the ability to set and clear arbitrary fields - or individual bits (which, if you're doing anything with fields, is probably not terribly rare.), and a mask generate. 

The data paths for all these are pretty clean, lay out well, and add a couple of gate delays max over and above the existing shifter (which is a big deal if the shift is slower than the adder, of course; that would not be a huge surprise because you need to decode the shift amount and drive it across the data path).

Signed field extract is a bit nastier, as you need to calculate where the sign position is, select it, replicate is across the datapath, and OR it into the extracted field.

From an ISA perspective, its mostly about how many parameters you want to be able to supply. e.g.:
 2 reg sources+                     dest (variable shift)
 2 reg sources+1short imm + dest,  (fixed rotate, dbl shift)
 2 reg sources+2short imm + dest,  (fixed depositt)
 1 reg source  +1 short imm + dest (fixed shift)
 1 reg source  +2 short imm + dest (fixed field extract, mask generate)
etc.

If you can tolerate 3 register operands (which an implementation that has fused MulAdd might require), you can get variable versions of much of the above, though then the the extra time need to generate the control signals (from a register instead of an early arriving immediate) becomes an issue. 

The hardest operation to implement is totally variable field insertion, since it requires source register, receiving register, position register, and length register, along with the actual destination register (4 sources). I don't think so.... unless you do the combined position+length in a single register hack (at byte boundaries, I suggest) -  but it does look like a hack. 

PA-RISC had most of the above, but cheated; variable shift amount or positions where relegated to a dedicated control register, and variable length fields had no support at all (it was assumed that variable field lengths were rare). 
Other architectures, going back to at least the IBM-801, had similar operations.

Other musings:
The guidelines as to which operations should be included are well thought out. the weighting of them is problematic (not a criticism- it can't be avoided)

Small IOT devices may want a lot of bit fiddling and control (which extract/deposit are probably useful for) - but they may also want crypto support ( if you've paid attention to the Mirai botnet).  But does that kind of product need dedicated crypto HW? in that case crypto support in the ISA may not be as important as you would think. That may be true of other uses of bit manipulation (e.g. do you put video decode/encode in HW, or just plop down an decoder/enocder block in your ASIC....)

We may want to think of classes of product, not just specific applications.



 

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Christopher Celio

unread,
Feb 7, 2017, 11:00:39 PM2/7/17
to Allen Baum, Bruce Hoult, Jacob Bachmeyer, Andrew Waterman, Michael Clark, RISC-V ISA Dev, 黃柏瑋
I greatly appreciate your email Allen.

One thing I wanted to voice my own opinion on (my apologies if this is absurdly obvious):

If you can tolerate 3 register operands (which an implementation that has fused MulAdd might require)

I don't want anybody to be misled into thinking that 3 operand integer instructions are free just because a processor already supports 3 operand FMAs. For starters, they would interface with different register files, so the third port would be a new cost to pay for many micro-architectures.


-Chris



To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Stefan O'Rear

unread,
Feb 7, 2017, 11:05:20 PM2/7/17
to Allen Baum, Bruce Hoult, Jacob Bachmeyer, Andrew Waterman, Michael Clark, RISC-V ISA Dev, 黃柏瑋
On Tue, Feb 7, 2017 at 7:39 PM, Allen Baum <allen...@esperantotech.com> wrote:
> A bunch of implementation comments;

Need to distinguish between

(1) instructions that might be added to an implementation because they
have positive benefit/reward on that implementation;

(2) instructions a specific program might use;

(3) instructions which are likely to be used by a large enough variety
of programs to warrant the compatibility burden in ISA revisions 10
years hence.

I appreciate your bringing more of the implementation perspective in here.

> Small IOT devices may want a lot of bit fiddling and control (which
> extract/deposit are probably useful for) - but they may also want crypto
> support ( if you've paid attention to the Mirai botnet). But does that kind

Mirai _didn't_ attack small IOT devices though. It attacked Linux
devices with factory-configured root passwords — a reminder there are
some design mistakes no amount of crypto hardware will save you from.

-s

Jacob Bachmeyer

unread,
Feb 8, 2017, 12:37:48 AM2/8/17
to Allen Baum, Bruce Hoult, Andrew Waterman, Michael Clark, RISC-V ISA Dev, 黃柏瑋
Allen Baum wrote:
> From an ISA perspective, its mostly about how many parameters you want
> to be able to supply. e.g.:
> 2 reg sources+ dest (variable shift)
> 2 reg sources+1short imm + dest, (fixed rotate, dbl shift)
> 2 reg sources+2short imm + dest, (fixed depositt)
> 1 reg source +1 short imm + dest (fixed shift)
> 1 reg source +2 short imm + dest (fixed field extract, mask generate)
> etc.
>
> If you can tolerate 3 register operands (which an implementation that
> has fused MulAdd might require), you can get variable versions of much
> of the above, though then the the extra time need to generate the
> control signals (from a register instead of an early arriving
> immediate) becomes an issue.

RISC-V already has variable shift, so bringing control signals from a
register should not increase delay on any other shifter operation. Mask
generation would get its parameters at the same time the shifter gets
the offset for the field to extract. It is good to know that bitfield
access will not add excessive ALU complexity.

> The hardest operation to implement is totally variable field
> insertion, since it requires source register, receiving register,
> position register, and length register, along with the actual
> destination register (4 sources). I don't think so.... unless you do
> the combined position+length in a single register hack (at byte
> boundaries, I suggest) - but it does look like a hack.

The catch is that the RISC-V base ISA adheres to a
3-operand/2-input/1-output model and continuing to adhere to that model
is probably important. So we can only encode a destination register, a
source register, and either an immediate or a second source register.
This leaves packing position+length into a single control word. I
proposed that the control word should use an asymmetric split to
maximize the utility of ADDI for loading values. A full DEPOSIT-FIELD
operation is probably not doable, since it would require 3 inputs even
with the control word hack. A double-shift or funnel-shift is likewise
unlikely, also needing too many inputs.

While "shift-in-1s", "rotate", and AND-NOT/OR-NOT can be easily encoded
near the existing shifts and logical operations, POPC, BREV, BSWAP,
MASK-FIELD, EXTRACT-FIELD, GENERATE-MASK, etc. will need some other
solution. This is another reason why I proposed only R-type bitfield
operations: the available encoding space is likely to be limited to
using the low-order bits of funct7 within the OP major opcode, which is
R-type. The "M" standard extension already does this. This leaves us
with 2 reg sources + dest, so we can only have variable operations that
can be expressed within those constraints. I suspect that the "control
word" hack is unavoidable and will probably be used with different
control words for different instructions.
> send an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To post to this group, send email to isa...@groups.riscv.org
> <mailto:isa...@groups.riscv.org>.
> <https://groups.google.com/a/groups.riscv.org/group/isa-dev/>.
> <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkza-2mwoy7p2qxCFpmtho1-uaE8xVkUvw0295yPXrB%3DNw%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>
>


Reply all
Reply to author
Forward
0 new messages