Load/store with bit offset and mask

Thomas Koenig

unread,

Jan 20, 2023, 2:07:46 PM1/20/23

to

A question.

These days, every reasonable highly-performing architecture allows
unaligned loads on a byte boundary. This includes being able to
handle crossing cache line boundaries, crossing page boundaries,
page faults and whatever else.

What about load and store instructions which allow, in addition
a constant or a register?

To be useful, stores at least would have to have an additional
parameter specifying the size, in bits, to be stored. For
symmetry, and to save masking off, loads should have that
capability, too.

For use cases, obviously compression comes to mind, plus image
processing (for example the formats having RGB in 3*10 bits).

Instruction format would be something like

LDBIT Rt,[Rb + R_bit_offset], R_mask

Comments? Too complicated? Makes the handling too slow/big
because of the three shifts? Has been tried recently and didn't
fly because... ?

MitchAlsup

unread,

Jan 20, 2023, 2:46:00 PM1/20/23

to

On Friday, January 20, 2023 at 1:07:46 PM UTC-6, Thomas Koenig wrote:
> A question.
>
> These days, every reasonable highly-performing architecture allows
> unaligned loads on a byte boundary. This includes being able to
> handle crossing cache line boundaries, crossing page boundaries,
> page faults and whatever else.
<

Agreed, aligned only memory of the 1980 RISC machines was a mistake.
Agreed, misaligned access can cross line and page boundaries.
Agreed this makes memory references potentially have 2 page faults.

>
> What about load and store instructions which allow, in addition
> a constant or a register?
<

My 66000 has::
<
STD #3.141592326584,[Rbase+Rindex<<scale+Dispalcement]

>
> To be useful, stores at least would have to have an additional
> parameter specifying the size, in bits, to be stored. For
> symmetry, and to save masking off, loads should have that
> capability, too.
<

You lost me, here.

>
> For use cases, obviously compression comes to mind, plus image
> processing (for example the formats having RGB in 3*10 bits).
>
> Instruction format would be something like
>
> LDBIT Rt,[Rb + R_bit_offset], R_mask
>
> Comments? Too complicated? Makes the handling too slow/big
> because of the three shifts? Has been tried recently and didn't
> fly because... ?
<

Insert and extract save the day:: removing bit alignment from
memory references.
<
LDD Rc,[Rbase+Rindex<<scale+Displacement]
SLA Rb,Rc,<width:offset>
<
INS Rc,Rb,<width:offset>
STD Rc,[Rbase+Rindex<<scale+Displacement]
<
Bit aligned memory references are 2-gate (plus wire delay) longer
than byte aligned memory references.
<
Given double width extract/insert and a check for container
stepping, generally gets the bit alignment done at low cost,
for this seldom needed feature.

Thomas Koenig

unread,

Jan 20, 2023, 3:56:05 PM1/20/23

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Friday, January 20, 2023 at 1:07:46 PM UTC-6, Thomas Koenig wrote:
>> A question.
>>
>> These days, every reasonable highly-performing architecture allows
>> unaligned loads on a byte boundary. This includes being able to
>> handle crossing cache line boundaries, crossing page boundaries,
>> page faults and whatever else.
><
> Agreed, aligned only memory of the 1980 RISC machines was a mistake.
> Agreed, misaligned access can cross line and page boundaries.
> Agreed this makes memory references potentially have 2 page faults.
>>
>> What about load and store instructions which allow, in addition
>> a constant or a register?
><
> My 66000 has::
><
> STD #3.141592326584,[Rbase+Rindex<<scale+Dispalcement]
>>
>> To be useful, stores at least would have to have an additional
>> parameter specifying the size, in bits, to be stored. For
>> symmetry, and to save masking off, loads should have that
>> capability, too.
><

> You lost me, here.

What I mean is the ability to load/store 1 to register size bits
from a base pointer + m bits (where m can be large), without
affecting the other bits in memory.

>>
>> For use cases, obviously compression comes to mind, plus image
>> processing (for example the formats having RGB in 3*10 bits).
>>
>> Instruction format would be something like
>>
>> LDBIT Rt,[Rb + R_bit_offset], R_mask
>>
>> Comments? Too complicated? Makes the handling too slow/big
>> because of the three shifts? Has been tried recently and didn't
>> fly because... ?
><
> Insert and extract save the day:: removing bit alignment from
> memory references.
><
> LDD Rc,[Rbase+Rindex<<scale+Displacement]
> SLA Rb,Rc,<width:offset>

This assumes that the bits to be loaded are contained
in a single 64-bit word, correct?

><
> INS Rc,Rb,<width:offset>
> STD Rc,[Rbase+Rindex<<scale+Displacement]

Same for store. The code if this crosses a boundary would
be somewhat ugly: Loading the first register, checking
if the bits come from one doubleword only, conditionally
loading a second one and splicing the data together.

><
> Bit aligned memory references are 2-gate (plus wire delay) longer
> than byte aligned memory references.
><
> Given double width extract/insert and a check for container
> stepping, generally gets the bit alignment done at low cost,
> for this seldom needed feature.

It's probably useful in compression and decompression, or if
somebody is really crazy about saving memory.

Anything else?

MitchAlsup

unread,

Jan 20, 2023, 4:00:33 PM1/20/23

to

LDD Rc,[Rbase+Rindex<<scale+Displacement]
LDD Rd,[Rbase+Rindex<<scale+Displacement+8]
CARRY Rd,{i}
SLA Rb,Rc,<width:offset>
<
does the any width from 1..64 from any position 127..0

Stephen Fuld

unread,

Jan 20, 2023, 4:02:40 PM1/20/23

to

When, some time ago, we discussed this in the context of IIRC picking
apart compressed strings, I thought you were going to implement a "load
bit field" instruction in order to reduce the number of instructions for
the inner loop by IIRC 3-4. Did you change your mind on this?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

John Levine

unread,

Jan 20, 2023, 4:30:35 PM1/20/23

to

It appears that Thomas Koenig <tko...@netcologne.de> said:
>A question.
>
>These days, every reasonable highly-performing architecture allows
>unaligned loads on a byte boundary. This includes being able to
>handle crossing cache line boundaries, crossing page boundaries,
>page faults and whatever else.
>
>What about load and store instructions which allow, in addition
>a constant or a register?
>
>To be useful, stores at least would have to have an additional
>parameter specifying the size, in bits, to be stored. For
>symmetry, and to save masking off, loads should have that
>capability, too.

The Vax had extract, insert, and compare bit field instructions. They
took an address, a size, and an offset. The only alignment restriction
was that if the address was a register, the field had to be entirely
within that register.

I think the IBM STRETCH had bit aligned loads and stores too but I don't
have the book about it handy.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

MitchAlsup

unread,

Jan 20, 2023, 4:37:46 PM1/20/23

to

On Friday, January 20, 2023 at 3:30:35 PM UTC-6, John Levine wrote:
> It appears that Thomas Koenig <tko...@netcologne.de> said:
> >A question.
> >
> >These days, every reasonable highly-performing architecture allows
> >unaligned loads on a byte boundary. This includes being able to
> >handle crossing cache line boundaries, crossing page boundaries,
> >page faults and whatever else.
> >
> >What about load and store instructions which allow, in addition
> >a constant or a register?
> >
> >To be useful, stores at least would have to have an additional
> >parameter specifying the size, in bits, to be stored. For
> >symmetry, and to save masking off, loads should have that
> >capability, too.
> The Vax had extract, insert, and compare bit field instructions. They
> took an address, a size, and an offset. The only alignment restriction
> was that if the address was a register, the field had to be entirely
> within that register.
>
> I think the IBM STRETCH had bit aligned loads and stores too but I don't
> have the book about it handy.
<

68020 had memory based bit field stuff.

MitchAlsup

unread,

Jan 20, 2023, 4:39:07 PM1/20/23

to

I must be getting old, I don't remember that conversation in enough
detail to say. But I don't think I ever contemplated bit aligned memory
references.

JimBrakefield

unread,

Jan 20, 2023, 7:31:16 PM1/20/23

to

On Friday, January 20, 2023 at 3:30:35 PM UTC-6, John Levine wrote:

The IBM Stretch had 24-bit bit-addressable memory pointers (2MB max)
Bytes could be from one to eight bits, used a 64-bit instruction versus 32-bit inst.
Also had branch on bit inst.

For modifying IO port configurations ARM bit banding makes sense?

Stephen Fuld

unread,

Jan 21, 2023, 2:30:52 AM1/21/23

to

I found it. Below I have pasted as a quotation, a post that includes
most of the context, the proposed solution, and your response.

> From: "MitchAlsup" <Mitch...@aol.com>
> Subject: Re: RISC-V vs. Aarch64
> Date: Fri, 21 Jan 2022 09:53:16 -0800 (PST)
> Message-ID: <9d652029-a997-4bec...@googlegroups.com>
> Lines: 143
>
> On Friday, January 21, 2022 at 11:05:28 AM UTC-6, Stephen Fuld wrote:
>> On 1/13/2022 9:53 AM, MitchAlsup wrote:
>
>> > I came up with this in 5 minutes::
>> > This assumes the input bit-length selector is an vector of characters and that the
>> > chars contain values from {1..64}
>> > <
>> > void unpack( uchar_t size[], uint64_t packed[], uint64_t unpacked[], uint64_t count )
>> > {
>> > uint64_t len,
>> > bit=0,
>> > word=0,
>> > extract,
>> > container1 = packed[0],
>> > container2 = packed[1];
>> >
>> > for( unsigned int i = 0; i < count; i++ )
>> > {
>> > len = size[i];
>> > bit += len;
>> > extract = ( len << 32 ) | ( bit & 0x3F );
>> > if( word != bit >> 6 )
>> > {
>> > container1 = container2;
>> > container2 = packed[++word];
>> > }
>> > unpacked[i] = {container2, container1} >> extract;
>> > }
>> > }
>> > <
>> > This translates into pretty nice My 66000 ISA:
>> > <
>> > ENTRY unpack
>> > unpack:
>> > MOV R5,#0
>> > MOV R6,#0
>> > LDD R7,[R2]
>> > LDD R8,[R2+8]
>> > MOV R9,#0
>> > loop:
>> > LDUB R10,[R1+R9]
>> > ADD R5,R5,R10
>> > AND R11,R5,#63
>> > SL R12,R10,#32
>> > OR R11,R11,R12
>> > SR R12,R6,#6
>> > CMP R11,R6,R12
>> > PEQ R11,{111}
>> > ADD R6,R6,#1
>> > MOV R7,R8
>> > LDD R8,[R2+R6<<3]
>> > CARRY R8,{{I}}
>> > SL R12,R7,R11
>> > STD R12,[R3+R9<<3]
>> > ADD R9,R9,#1
>> > CMP R11,R9,R4
>> > BLT R11,loop
>> > RET
>> > <
>> > Well at least straightforwardly.
>>
>>
>> If Terje is right, and he almost always is, it is worth trying to come
>> up with a better solution for this type of problem. So, as a start, I
>> came up with what follows. This certainly isn’t the final solution. It
>> is intended to start a discussion on better ways to do this. And the
>> usual disclaimer, IANAHG, so this is from a software perspective. But I
>> did try to fit it “in the spirit” of the MY 66000, and it takes
>> advantages of that design’s unique capabilities.
>>
>> The idea is to add one new instruction, which typically would be in the
>> shadow of a preceding Carry meta instruction. I called the new
>> instruction Load Bit Field (LBF).
>>
>> It is a two source, one result instruction, but uses the carry register
>> for an additional source and destination. The syntax is
>>
>> LBF Result register, field length (in bits), buffer starting address
>> (in bytes)
>>
>> The carry register contains the offset, in bits, from the start of the
>> buffer where the desired field starts.
>>
>> The instruction computes the start of the desired field by adding the
>> high order all but three bits of the carry register to get the starting
>> byte number, then uses the low order three bits to get the starting bit
>> number. The instruction extracts the field, starting at the computed
>> bit address with length as given in the register specified in the
>> register, and right justifies that field in the result register. The
>> higher order bits in the result register are set to zero. If the output
>> bit of the Carry instruction is set, the length value is added to the
>> Carry register.
> <
> A bit more on the CISC side than desired (most of the time)--3
> exceptions possible, 2 memory accesses. Also note, my original
> solution can produce signed or unsigned output stream. This is
> going to take 2 cycles in AGEN, and 2 result register writes.
>>
>> In order to speed up this instruction, and given that it will frequently
>> occur in a fairly tight loop, I think (hope) that the hardware can take
>> advantage of the “streaming” buffers otherwise used for VVM operations.
>> Anyway, if one had this instruction, the main loop in the code above
>> could be something like
>>
>>
>> loop:
>> LDUB R10,[R1+R9]
>> CARRY R6,IO
>> LBF R12,R10,R2 ;I am not sure about R2, It should be the start of
>> the packed buffer.
>> STD R12,[R3+R9<<3]
>> ADD R9,R9,#1
>> CMP R11,R9,R4
>> BLT R11,loop
>>
>> For a savings of about 10 instructions in the I cache, but fewer in
>> execution (but still significant) depending upon how often the
>> instructions under the predicate are executed.
>>
> I have to admit, this looks fairly juicy--just have to plow my way
> through and see what comes out.
>>
>> Anyway, Of course, I invite comments, criticisms, etc. One obvious
>> drawback is that this only addresses the "decompression" side. While I
>> briefly considered a "Store Bit Field", I discarded it as it seemed too
>> complex, and presumably would used less frequently, as
>> compression/coding happens less frequently than decompression/decoding.

end of copied old post.

MitchAlsup

unread,

Jan 21, 2023, 11:49:48 AM1/21/23

to

Exactly 1 year ago today.
<
In any event, this has not yet made it into my ISA.

MitchAlsup

unread,

Jan 21, 2023, 4:47:54 PM1/21/23

to

On Saturday, January 21, 2023 at 1:30:52 AM UTC-6, Stephen Fuld wrote:
> On 1/20/2023 1:39 PM, MitchAlsup wrote:
> >> On 1/20/2023 11:45 AM, MitchAlsup wrote:

> >> > void unpack( uchar_t size[], uint64_t packed[], uint64_t unpacked[], uint64_t count )
> >> > {
> >> > uint64_t len,
> >> > bit=0,
> >> > word=0,
> >> > extract,
> >> > container1 = packed[0],
> >> > container2 = packed[1];
> >> >
> >> > for( unsigned int i = 0; i < count; i++ )
> >> > {
> >> > len = size[i];
> >> > bit += len;
> >> > extract = ( len << 32 ) | ( bit & 0x3F );
> >> > if( word != bit >> 6 )
> >> > {
> >> > container1 = container2;
> >> > container2 = packed[++word];
> >> > }
> >> > unpacked[i] = {container2, container1} >> extract;
> >> > }
> >> > }
<

I spend an hour and came up with a better subroutine::

<
void unpack( uchar_t size[], uint64_t packed[],
uint64_t unpacked[], uint64_t count )
{
uint64_t len,
bit =0,

word =1,
container1,
container2 = packed[0];

for( unsigned int i = 0; i < count; i++ )
{

container1 = container2;
container2 = packed[word++]
do {
len = size[i];
unpacked[i] = ({container2, container1} >> bit)
& ~(~0 <<len);
bit += len;
} while( bit < 64 );
bit &= 63;
}
}
<
Which has the ability to be compiled into::
<
unpack:
MOV R6,#0
MOV R7,#1
LDD R9,[R2]
begin_for_loop:
MOV R10,#0
for_loop:
MOV R8,R9
LDD R9,[R4+R7<<3]
begin_do_loop:
VEC R12,{}
do_loop:
LDUB R5,[R1+R10]
CARRY R9,{I}
SLL R11,R8,R6
SRL R12,#-1,R5
AND R11,R11,~R12
LOOP LE,R6,R5,#63
end_do_loop:
AND R6,R6,#63
ADD R10,R10,#1
CMP R11,R10,R4
BLT R11,for_loop
end_for_loop:
RET
<
Inner loops execute 5 instructions {LDSB, SRL, SLL, AND, LOOP}
and contains the CARRY instruction-modifier.
<
Apparently creating a mask and using it is just as efficient as building an
extraction variable (so I changed the code to reflect).
<
Also note:: the previous code was off by the length of the first bit field.
<
Latency analysis::
<
The first 3 instructions can issue together (depending on width of machine).
Instructions {4 and 6} can begin execution as soon as the LDUB data arrives.
So the loop latency is LD-latency+2 = 5 cycles.
<
I doubt any small increment in the memory reference instructions would
make this look "lots faster", and is therefore contraindicated.

Thomas Koenig

unread,

Jan 22, 2023, 9:07:08 AM1/22/23

to

MitchAlsup <Mitch...@aol.com> schrieb:

> I spend an hour and came up with a better subroutine::
><
> void unpack( uchar_t size[], uint64_t packed[],
> uint64_t unpacked[], uint64_t count )
> {
> uint64_t len,
> bit =0,
> word =1,
> container1,
> container2 = packed[0];
>
> for( unsigned int i = 0; i < count; i++ )
> {
> container1 = container2;
> container2 = packed[word++]
> do {
> len = size[i];
> unpacked[i] = ({container2, container1} >> bit)
> & ~(~0 <<len);
> bit += len;
> } while( bit < 64 );
> bit &= 63;
> }
> }

Cleaning this up a little bit, I get

#include <stdint.h>

typedef unsigned char uchar_t;

void
unpack (uchar_t size[], uint64_t packed[], uint64_t unpacked[], uint64_t count)
{
uint64_t len, bit = 0, word = 1, container1, container2 = packed[0];

for (unsigned int i = 0; i < count; i++)
{

container1 = container2;
container2 = packed[word++];
__uint128_t cont;
cont = ((__uint128_t)container2 << 64) | container1;

do
{
len = size[i];

unpacked[i] = (cont >> bit) & ~(~0 << len);

bit += len;
}
while (bit < 64);
bit &= 63;
}
}

> Which has the ability to be compiled into::

Nicely put :-)

Right now, it hits https://github.com/bagel99/llvm-my66000/issues/24
(which I just opened today, before reading your post).

Stephen Fuld

unread,

Jan 22, 2023, 3:10:03 PM1/22/23

to

Nice. It looks like you have reduced the advantage of the load bit
field instruction. Some comments

1. When I made the suggestion, I was just looking to make minimal
changes to the generated code. I didn't look at the possibility of
using the VEC/LOOP instructions. If these could be used loop for the in
my code, I think it saves another instruction in the loop.

2. Your solution still have the overhead of extra code for when you
cross a 64 bit boundary. You have changed it from a big PRED, to an
outer loop. But it is still there. So you have to factor in the extra
instructions in the outer loop times 64 divided by the average bit field
length. Note that that also means you have to load and parse the inner
loop once per 64 bits, whereas, my proposed solution handles 64 bit
crossings without any extra instructions, nor does it require the
reload/reparse of the inner loop (assuming you can use VVM for my inner
loop)

3. Your solution requires three instructions per bit field,two shifts
and an AND, whereas these are combined into a single instruction in my
proposed solution.

Of course, whether the using the proposed instruction is "lots faster"
depends on your definition of "lots". :-)

MitchAlsup

unread,

Jan 22, 2023, 3:39:24 PM1/22/23

to

I have coded this with and without VEC/LOOP

>
> 2. Your solution still have the overhead of extra code for when you
> cross a 64 bit boundary. You have changed it from a big PRED, to an
> outer loop. But it is still there. So you have to factor in the extra
> instructions in the outer loop times 64 divided by the average bit field
> length. Note that that also means you have to load and parse the inner
> loop once per 64 bits, whereas, my proposed solution handles 64 bit
> crossings without any extra instructions, nor does it require the
> reload/reparse of the inner loop (assuming you can use VVM for my inner
> loop)
<

I spent ½ hour trying to put a loop around extraction from 64 bits
the loop overhead makes this impractical. I have to keep track of
bits, i, word, all independently. This makes the smallest inner loop
larger than the single loop with the embedded if statement.
<
void unpackS( uchar_t size[], uint64_t packed[],
uint32_t unpacked[], uint64_t count )
{
uint32_t len,
bit =0,
word =1,
container = packed[0];

for( unsigned int i = 0; i < count; bit &= 31 )
{
container |= packed[word++] << 32;

do {
len = size[i];

unpacked[i] = container & ~(~0 <<len);
container >>= len;
bit += len;
i++;
} while( bit < 32 && i < count )
}
}
<
I also tried 32-bit containers (above) and 64-bit containers (previous);
the best code I can see is as shown previously.
>
> 3. Your solution requires three instructions per bit field, two shifts

> and an AND, whereas these are combined into a single instruction in my
> proposed solution.
<

I used the ~0<<len shift instead of loading a mask.
My alternative of inserting the len into the shift (making it an extract)
takes more instructions than generating my own mask !?!