I found it. Below I have pasted as a quotation, a post that includes
most of the context, the proposed solution, and your response.
> From: "MitchAlsup" <
Mitch...@aol.com>
> Subject: Re: RISC-V vs. Aarch64
> Date: Fri, 21 Jan 2022 09:53:16 -0800 (PST)
> Message-ID: <
9d652029-a997-4bec...@googlegroups.com>
> Lines: 143
>
> On Friday, January 21, 2022 at 11:05:28 AM UTC-6, Stephen Fuld wrote:
>> On 1/13/2022 9:53 AM, MitchAlsup wrote:
>
>> > I came up with this in 5 minutes::
>> > This assumes the input bit-length selector is an vector of characters and that the
>> > chars contain values from {1..64}
>> > <
>> > void unpack( uchar_t size[], uint64_t packed[], uint64_t unpacked[], uint64_t count )
>> > {
>> > uint64_t len,
>> > bit=0,
>> > word=0,
>> > extract,
>> > container1 = packed[0],
>> > container2 = packed[1];
>> >
>> > for( unsigned int i = 0; i < count; i++ )
>> > {
>> > len = size[i];
>> > bit += len;
>> > extract = ( len << 32 ) | ( bit & 0x3F );
>> > if( word != bit >> 6 )
>> > {
>> > container1 = container2;
>> > container2 = packed[++word];
>> > }
>> > unpacked[i] = {container2, container1} >> extract;
>> > }
>> > }
>> > <
>> > This translates into pretty nice My 66000 ISA:
>> > <
>> > ENTRY unpack
>> > unpack:
>> > MOV R5,#0
>> > MOV R6,#0
>> > LDD R7,[R2]
>> > LDD R8,[R2+8]
>> > MOV R9,#0
>> > loop:
>> > LDUB R10,[R1+R9]
>> > ADD R5,R5,R10
>> > AND R11,R5,#63
>> > SL R12,R10,#32
>> > OR R11,R11,R12
>> > SR R12,R6,#6
>> > CMP R11,R6,R12
>> > PEQ R11,{111}
>> > ADD R6,R6,#1
>> > MOV R7,R8
>> > LDD R8,[R2+R6<<3]
>> > CARRY R8,{{I}}
>> > SL R12,R7,R11
>> > STD R12,[R3+R9<<3]
>> > ADD R9,R9,#1
>> > CMP R11,R9,R4
>> > BLT R11,loop
>> > RET
>> > <
>> > Well at least straightforwardly.
>>
>>
>> If Terje is right, and he almost always is, it is worth trying to come
>> up with a better solution for this type of problem. So, as a start, I
>> came up with what follows. This certainly isn’t the final solution. It
>> is intended to start a discussion on better ways to do this. And the
>> usual disclaimer, IANAHG, so this is from a software perspective. But I
>> did try to fit it “in the spirit” of the MY 66000, and it takes
>> advantages of that design’s unique capabilities.
>>
>> The idea is to add one new instruction, which typically would be in the
>> shadow of a preceding Carry meta instruction. I called the new
>> instruction Load Bit Field (LBF).
>>
>> It is a two source, one result instruction, but uses the carry register
>> for an additional source and destination. The syntax is
>>
>> LBF Result register, field length (in bits), buffer starting address
>> (in bytes)
>>
>> The carry register contains the offset, in bits, from the start of the
>> buffer where the desired field starts.
>>
>> The instruction computes the start of the desired field by adding the
>> high order all but three bits of the carry register to get the starting
>> byte number, then uses the low order three bits to get the starting bit
>> number. The instruction extracts the field, starting at the computed
>> bit address with length as given in the register specified in the
>> register, and right justifies that field in the result register. The
>> higher order bits in the result register are set to zero. If the output
>> bit of the Carry instruction is set, the length value is added to the
>> Carry register.
> <
> A bit more on the CISC side than desired (most of the time)--3
> exceptions possible, 2 memory accesses. Also note, my original
> solution can produce signed or unsigned output stream. This is
> going to take 2 cycles in AGEN, and 2 result register writes.
>>
>> In order to speed up this instruction, and given that it will frequently
>> occur in a fairly tight loop, I think (hope) that the hardware can take
>> advantage of the “streaming” buffers otherwise used for VVM operations.
>> Anyway, if one had this instruction, the main loop in the code above
>> could be something like
>>
>>
>> loop:
>> LDUB R10,[R1+R9]
>> CARRY R6,IO
>> LBF R12,R10,R2 ;I am not sure about R2, It should be the start of
>> the packed buffer.
>> STD R12,[R3+R9<<3]
>> ADD R9,R9,#1
>> CMP R11,R9,R4
>> BLT R11,loop
>>
>> For a savings of about 10 instructions in the I cache, but fewer in
>> execution (but still significant) depending upon how often the
>> instructions under the predicate are executed.
>>
> I have to admit, this looks fairly juicy--just have to plow my way
> through and see what comes out.
>>
>> Anyway, Of course, I invite comments, criticisms, etc. One obvious
>> drawback is that this only addresses the "decompression" side. While I
>> briefly considered a "Store Bit Field", I discarded it as it seemed too
>> complex, and presumably would used less frequently, as
>> compression/coding happens less frequently than decompression/decoding.
end of copied old post.