It's common knowledge that a computer stores data only in bits: the 1's
and 0's that comprise the binary alphabet 0,1.
While that is so, it's not always easy or efficient to deal with bits
when programming a computer, because the computer's method to address
data works with blocks of bits, 8, 16, 32, or 64 bits at a time. Thus
it is not always simple or efficient to interpret or express long
sequences of bits that aren't used a fixed, hardwired amount of bits at
For example, the data of interest might be bits 4-19, numbered from 0,
in a sequence of four bytes, 32 bits. You can only address specifically
any one of those bytes, and then it is necessary to pick apart the byte
and put together bits from different bytes, and it is cumbersome to
handle that. As well, if you want to emit, say, three bits then six
bits then two bits into a couple bytes, again it involves setting the
contents of one byte, then modifying it and the next byte also for the
next bit sequence, and then modify that next byte again, and then
keeping track of how to modify that next one.
There are various requirements to access the bits. Let's focus on
reading bits first. Basically then there is some input data, of some
arbitrary length. Let's say the length of the input data, in bytes, is
known, although because the bitstring is not necessarily a multiple of
eight bits, the last zero to seven bits of the byte sequence are
meaningless. For simplicity, say there is a multiple of eight bits
stored in a byte sequence.
That is two bytes, and the bit representation is thus:
0001 0010 0011 0100
If that is data stored two or four bits at a time, it is not so bad:
just read two bits four times for each byte for two bits. Yet, when the
data is stored three bits at a time, then you can read the first three
bits from the first byte, and the next three bits from the first byte,
but then you have to read two bits from the first byte and then the
first bit from the second byte, and combine them, and then in reading
the second byte it is necessary to read the second, third, and fourth
bit, and then the fifth, sixth, and seventh, and then there is an extra
bit there that is unused.
It gets even more complicated when the amount of bits to read from two
bytes varies. For example, you might need to read five bits, and then
six bits, and then five bits, which at least starts over each two bytes
(16 bits), but it might be necessary to read three bits, and then from
those three bits detemine how many bits the next item contains, and so
Dealing with bitstreams rapidly leads to many variables, edge cases,
inefficiencies, and logical complications. Thus, it's a good idea to
determine a classification of bit-wise access strategies to help enable
efficient bitstream use.
In reading bits, we might consider these cases: a fixed number of bits
for each record, and a varying or even dynamic number of bits for each
record. Let's start with a fixed number of bits for each record.
Bits in a byte are generally numbered from 7 to 0. This is because the
least significant bit is called bit 0, and the most significant bit (of
eight bits) is called bit 7, and, because bits in a byte are generally
ordered from most-significant-bit (msb) to least-significant-bit
(lsb). The bits toward the left side are called the high bits, towards
the right side the low bits.
Now in reading a fixed width of bits from an array of bytes that can
only be addressed each eight bits, then there are two cases: if the
width of the bits is a power of two less than or equal to the word
width, then the records can be accessed relatively simply. For example,
where the word width is eight for a byte, then bit records of fixed size
2, 4, or 8 can be accessed in a simple manner, because each word can be
considered without consideration of the previous or next word, or the
index of the word in the list of words.
So an access pattern might be, for each word, for a word that is a byte
and fixed two bits:
1. address the word
2. get first two bits
3. get second two bits
4. get third two bits
5. get fourth two bits
6. go to step 1 while there are any more words in the input
That is not too bad, for example in C it might be:
byte b = bytearray[i];
processbits(b & 0xC0 >> 6);
processbits(b & 0x30 >> 4);
processbits(b & 0x0C >> 2);
processbits(b & 0x03);
if(i<bytearraylength) goto processbyte;
Then, where there are fixed length records of three bits instead of two,
then instead of that loop that has about five lines, it takes upwards of
forty or fifty lines to as efficiently as possible from C process in
order those fixed length records.
Data is often stored in bit records that are word width:
if(i<bytearraylength) goto processbyte;
So anyways in considering bit access strategies, there are lots of cases
to consider, and what ends up happening is that for each different
access pattern, and indeed many times for each particular
implementation, a unique and custom-built and often quite inefficient
access strategy is laboriously implemented by hand.
Partially that's because there is no standardized method or set of
patterns to accomplish the needed work of various bitstream access
There are a lot of ways that bitstreams are used to encode data. For
example, if an image has a given number of bits per sample, then often
the data is packed thus that the data takes less room than if it were
simply using a few bits of each word. Then, to modify that data for
most purposes, it needs to be either unpacked or specially modified as
There seem to be two major paradigms of the bit handling: reading and
writing. In reading, the input is a bitstream and the output is a list
of symbols that represent the data encoded into those bits. In writing
it is nearly the opposite: the output is a bitstream and the input is a
list of symbols.
One of the primary cursable aspects of dealing with the bitstream data
is that the bit groups are not necessarily evenly dividing into the word
width, thus that reading or writing those bit to a contiguous bit stream
demands in those cases that multiple words be considered at a time.
Instead of considering only a single register of the computer at a time,
two or even more registers need to be handled at once. It's bad enough
when a bit sequence may be within two registers, it's worse when the bit
sequence width is more than that of two or more registers.
When the bit sequence is within one register, then the operations to
isolate it and output it basically unpacked to an array of items that
each can contain the largest bit sequence is not so bad. For example,
where the data is msb to lsb in bit order, and MSB to LSB in byte order,
then on a big-endian machine (MSB to LSB) with 32-bit registers, 32
contiguous bits of the bitstream can be loaded onto a register at a
time. Then, the input register is copied into a working register.
Then, the register can be shifted right, towards the lsb, thus that the
lsb of the bit sequence is the lsb of the register. Then, where the
width of the bit sequence is known, the high order bits of the register
are masked off or cleared, and then the register contains the result bit
sequence that is then a machine integer representing the needed data.
That register is then copied into a memory location, the memory pointer
is updated to refer to the next location, and then the next bit sequence
is parsed/scanned from the original register.
That depends on that: the bit sequence width is known ahead of time,
and that the offset of the beginning of the bit sequence in the register
and the width of the bit sequence is less than the width of the
register. If the bit sequence straddles registers, then multiple
registers need to be considered, and it gets messy.
So, I've looked around, what do you think about using assembly language
to manipulate bitstreams? What algorithms, minimal or exotic, are
reusable for various schemes? How do you use assembler to save the the
processor billions of cycles over any "high" level language for bit
Thanks for your opinions,
"Also, consider this: the unit impulse function times
one less twice the unit step function times plus/minus
one is the mother of all wavelets."
| Programming with Bits
[... see OP]
| So, I've looked around, what do you think about using assembly language
| to manipulate bitstreams? What algorithms, minimal or exotic, are
| reusable for various schemes? How do you use assembler to save the the
| processor billions of cycles over any "high" level language for bit
Even not the fastest solution, I'd use the bit-instructions:
BSF/BSR/BT/BTC/BTS/BTR in comination with SHL/SHR/ROR/ROL/OR/AND/XOR.
In the register form: "BTx [mem],eax" the BT-group allow
bit-manipulation for up to +-2 Gigabit offsets (+-256 MB).
wolfgang kern wrote:
Primarily, I didn't know BT, Bit Test, had any meaning for values of the
source operand greater than 32.
One thing I don't understand is that when the destination operand is a
memory location, and the source operand is the bit index, from lsb to msb,
how that would access the bits in a happy order. For example of the two
words (register widths) at the memory offset X are
ie, big-endian data, then I want to access x, then x+1, then x+2, then x+3,
etcetera. Many and arguably most forms of interchange data are stored in
big-endian byte order. Here, only the bits are at issue, but if I call
bt [x], 0
then it sets the carry flag for bit 0 of 0x78, [x+3], instead of bit 7 of
[x]. This is where the basic notion of most bit scanners is to scan the
bits left to right in ascending memory address order.
To scan in the "natural" order, or left to right, I would have to decrement
in multiples of 32, ie test bit 31, 30, 29, .., 0, 63, 62, 61, ..., 32, 95,
94, .... testing 32 bits in order and then adding 63 to the bit offset. If
the source operand is a 16 bit register instead of a 32 bit register, then
the cycle is 16 instead of 32 bits.
This is compared to using TEST, with variously rotating the source operand
one bit per iteration or shifting it right and starting with a new one after
32 iterations or using an list of 32 immediate operands for testing each bit
of the 32 bit register.
BT does have some advantage, you don't need to load a new destination each
32 bits. That's why it surprised me that you can test any bit within 256
megabytes of a memory location with BT.
TEST is denoted as having a latency of 0.5, the same as AND, where BT is
denoted as having a latency of 3.
With BT, the carry flag is set if the bit was set, with TEST, the zero flag
is set where the bit test TEST operand is a power of two, i.e. single set
bit. Thus, with Jcc, conditional JMP, the conditional jump instructions for
BT are JC orJNC for the carry flag being set or not, and JZ and JNZ for the
zero flag being set or not.
In my novice designs here, I was thinking about using BT because I saw a way
to get an offset out of the carry flag with ADC, add with carry. The notion
with that is avoiding a branch, because in compressed data, for example,
where bit scanning is required to determine the meaning of the next bit, as
opposed to reading some fixed or varying number of bits, the bits are
basically random, and branch prediction is futile with random data.
Given the available specification, it seems that using TEST is faster than
using BT for each of contiguous sequences of the bits. If you need to
access sparse offsets within the bit stream, then BT may well be faster.
If you have to address a larger amont of data than 256 megabytes, which is
unlikely, then using the TEST instead of BT method leads to simply 32*4 =
128 times as large an addressable area.
I try to think of some other way basically about removing a branch in favor
of an offset calculation. If you use AND instead of TEST, then to get a
value that is either 1 for a set bit or 0 for a cleared bit, the destination
would have to be shifted or rotated. Having duplicate offsets at each of
0x80000000m, 0x40000000 etcetera in process memory offsets would probably be
wrong, other things are loaded at those offsets and most regular memory
allocation schemes do not have a suggested offset.
Because the carry flag and zero flag are stored in the EFLAGS register,
which is "not" directly accessible via some instruction, and the ADC
instruction is slow, probably because it has to use a bit test to get the
Thus, there is a conditional branch on the bit being either zero, jz, or
one, jnz, with using the TEST instruction, which conveniently does not
modify the source operand, just as BT does not modify the source operand,
only the EFLAGS. It's convienient that it doesn't modify the source operand
because the source operand is reused to test the next bit.
I consider something like this, for testing bit 31 and avoiding a branch:
mov ecx, edx ; edx contains current word, ecx is scratch
and ecx, 0x80000000 ; clear low bits of ecx
shr ecx, 30 ; now ecx contains zero if bit 31 clear, 4 if bit 31 set
mov [output], [table+ecx] ; with no branch, one of two values is copied to
If that were possibly faster than simply using a branch, then it would have
to be where there are immediate operands for each of the bitmasks and each
of the shifts and variously shl or nop for the third instruction, to get 0
or 4 into ecx without using a conditional branch, Jcc.
That's because branch misprediction can cause an 8-14 cycle stall, or
something bad, and abunch of random ones and zeroes could trip the branch
predictor. Correct branch prediction can occur in almost zero time.
That's about avoiding a conditional branch on random inputs.
Herman Dullink wrote:
That is what I would like to know: the fastest way to access the bits on
each of the commodity CPU's by any means necessary.
In this case, an unroll factor of 8 will *always* align you to a byte
boundary (8*x is always a multiple of 8), and it makes the algorithm much
simpler and faster. Out of all of the widths <= 8, the only ones that won't
evenly divide into a byte are 3, 5, and 7. The latter can be optimized
fairly easily since only 1 bit need be shifted in/out for each 7 bits read.
The other 2 are oddball cases that can be optimized for but are a little
> It gets even more complicated when the amount of bits to read from two
> bytes varies. For example, you might need to read five bits, and then
> six bits, and then five bits, which at least starts over each two bytes
> (16 bits), but it might be necessary to read three bits, and then from
> those three bits detemine how many bits the next item contains, and so
As long as it is a compile-time constant as it generally is, this is still
not a big deal. Logic similar to the above will suffice because the compiler
can predict the sizes. Inevitably you will fall into repitition and that can
be exploited to make things align on a byte boundary.
If the bitfield widths are determined at runtime then, well, you're screwed.
I would guess that shld/shrd are the fastest way to extract them, and these
instructions are pretty slow.
> Then, where there are fixed length records of three bits instead of two,
> then instead of that loop that has about five lines, it takes upwards of
> forty or fifty lines to as efficiently as possible from C process in
> order those fixed length records.
char *data; // set this somewhere
size_t n_chars; // this too
for(size_t i = 0; i < (n_chars / 3); i++)
long j = *((long *) data);
data += 3; // making non-portable assumption here
process_bits((j & 0x000007) >> 0);
process_bits((j & 0x000038) >> 3);
process_bits((j & 0x0001C0) >> 6);
process_bits((j & 0x000E00) >> 9);
process_bits((j & 0x007000) >> 12);
process_bits((j & 0x038000) >> 15);
process_bits((j & 0x1C0000) >> 18);
process_bits((j & 0xE00000) >> 21);
This is possibly inefficient space-wise, but it is not inefficient
computationally. IAF that is not 40-50 lines of C code.
> There are a lot of ways that bitstreams are used to encode data. For
> example, if an image has a given number of bits per sample, then often
> the data is packed thus that the data takes less room than if it were
> simply using a few bits of each word. Then, to modify that data for
> most purposes, it needs to be either unpacked or specially modified as
> packed bytes.
These (from my experience, anyway) tend to stick to byte or multi-byte
boundaries in order to simplify, and the bit indexes are predetermined.
Macromedia Flash is about the worst case I can think of: they routinely
compress structures by specifying the number of bits required for certain
members of the structure. For example, a rectangle will have a fixed 5-bit
field determining the length of the integers used to represent the
coordinates. It still pads out to a byte boundary of some sort, though I do
not recall now the specifics of it. This sort of thing would require lots of
additional overhead or specially coded routines in a switch-case.
> That depends on that: the bit sequence width is known ahead of time,
> and that the offset of the beginning of the bit sequence in the register
> and the width of the bit sequence is less than the width of the
> register. If the bit sequence straddles registers, then multiple
> registers need to be considered, and it gets messy.
This limits it to 2 registers unless the bitfield's size is larger than the
register's size. I've never heard of such a case, though I've no doubt
someone has run into it somewhere. It's just not very common.
> [...] IAF that is not 40-50 lines of C code.
What does "IAF" mean?
| Primarily, I didn't know BT, Bit Test, had any meaning for values of the
| source operand greater than 32.
Works only in the register form BT[mem],r32 not with BT[mem],i32.
[big endian ...]
| To scan in the "natural" order, or left to right,...
It's not 'natural' at all, it's just 'humans familiarity' :)
Are the bits in a byte not always little endian ordered ?
If so, then you got a 'mixed endianess' and you can only work with
byte-bounds and you have calculate the offsets first
(bit-offset='n mod 8'='and n,7', byte offset='n/8'='shr n,3').
To work on opposite ordered (true big-endian) data with the little
endian x86, just start at the end of the line (the top of) and negate
[TEST vs. BT]
Yes, speed depends on the bit-string size and desired result format.
RCL eax,1 ;or any shift value
BTW: why expand a bit-image to true/false byte/dw-values ?
You may also have a look to the PAND,PNAND,POR,PNOR,PXOR instructions,
perhaps they will easier do what you like to perform.
Yes, you're right about not needing 40 or 50 lines of code. I was thinking of
some code that has to deal with image tiles where the edge tiles are a
In dealing with fixed length bitstrings where the length does not divide into
8, 16, 32, etc., then for length L you need to use the integral number of bytes
N thus that N *8 mod L = 0. In your example for 3 bits, you use 3 bytes for 24
bits, where 3 divides evenly into 24. For 5 bits, and this is rather obvious,
you need 5 bytes worth of input before restarting with a fixed length bit
sequence (bitstring, bit string) at the beginning of a byte, and for 7 bit wide
codes, 7 bytes.
In the case of three bits per code, in the high level language it might appear
process( bytes[i] >> 5);
process ( bytes[i] & 0x1C >> 2)
process ( (bytes[i] & 0x03 << 1) | (bytes[i+1] & 0x80 >> 7) );
process ( bytes[i+1] & 0x70 >> 4);
process (bytes[i+1] & 0x0E >> 1);
process ( (bytes[i+1] & 0x01 << 2) | (bytes[i+2] & 0xC0 >> 6) );
process ( bytes[i+2] & 0x38 >> 3);
process ( bytes[i+2] & 0x07);
The only time I ever see packed fixed width 3 bit codes is in 3 bits per sample
images, which are totally rare compared to even 4 bits per sample images for 16
color grayscale, or 2 bits per sample, or 1 bit or 8 bits per sample images.
So generally with the 3 bit per sample images, you want to use each 3 bits to
either a: get a color from a palette, or b: left shift each 3 bits in its own
byte to form an 8 bit value. Now, for the PNG style deal, you want to fill
the bits as you shift with a copy of those three bits to better represent the
saturation near the high end of the value, but that's different.
In either of those cases, generally you put each bit sequence into its own
byte, then you can deal with an array of bytes instead of a packed array of 3
bit sequences within a somewhat smaller array of bytes. This is where modern
general purpose computers have a remarkable amount of core memory often
measured in the hundreds of megabytes, backed by virtual memory implemented by
high speed storage devices that ranges into the hundreds of gigabytes.
Alternately, each 3 bits is processed in place and the array is not unpacked.
So if trying to categorize the use of these fixed length codes, then one notion
is to process each fixed length code, in order, and another notion is to unpack
the sequences into an array of basically word widths values: 8, 16, or 32
bits, with variously left alignment, right alignment, or modified left
alignment, that is, the 3 bits go in the most significant bits, the least
significant bits, or the most significant bits with copies filling the less
Now for the lower level language, assembler, basically the lowest level
language, discounting microcode, for x86 there are 32 bit registers. Thus, if
the consideration is to load input data on the register and read it before
discarding that data, then for 3 bit packed data it will take 3 registers worth
of data before starting over.
mov eax, [input];
; process each 3 bits of eax
mov ebx, [input+4]
; process each 3 bits of ebx and previous bits of eax
mov eax, [input+8]
; process each 3 bits of eax and previous bit of ebx
Here, the idea is that while it takes three words of input to get an integral
number of 3 bit sequences, you only need to have on registers two at a time,
because the fixed bitstring width is less than the friggin word width. If it
were one bit greater, then you'd still only need two registers, but if it were
two bits greater then three registers would be needed, because of the case
where the the msb of the bitstring is the lsb of a register.
So anyways you might have eax filled with 32 bits from input. The x86 is
little-endian, it loads the LSB (capitals for byte, lower case for bit), it
loads the LSB, bits 7-0, of EAX with the byte at [input], bits 15-8 with the
byte at [input+1], and bits 23-16 with the byte at [input+2], and bits 31-24
with the memory contents at address [input+3].
Now, when you want to process the contents of that register, to process it in
the same order as it is laid out in the data file, you need first process the
low byte, ie AL, then the next byte, AH, then process the byte comprising bits
23-16 of EAX, then the byte comprising bits 31-24 of EAX.
Some images are stored 11 bits per sample. Somewhere, there is a device
putting out data packed in pretty much any number of bits per sample less than
about 64. There is no theoretical limit, in some theories.
So anyways you can use a byte at a time instead of a register at a time. That
makes some things easier, but working with a register at a time gives four
times as much data, where the register mov instructions should be about as fast
as a byte mov instruction, neglecting instruction encoding size. So anyways,
with those 32 bits on a register, then there are the problems that, for
example, one of the 3 bit codes is: bit 1, bit 0, bit 8, where the first two
packed 3 bit sequences at [input] are bits 7,6,5 and bits 4,3,2. If you use
the bswap instruction on eax after loading eax, then the bit sequences are 31,
20, 29, and then 28, 27, 26, ..., to 1, 0, where then the next bit of that
sequence is bit 31 of the big-endianized ebx.
Another problem with using a register at a time is that the data buffer might
not be an integral number of 32-bit words, it is probably an integral number of
8-bit words with extraneous padding bits, and the buffer may hav enot been
allocated to be on a 32-bit boundary, which would make it folly in terms of
performance to load the register with 32 bits that is not on a 32-bit boundary,
as well, the array may be not allocated to an integral number of 32 bits,
leading to extra checking to not overread the buffer.
There are some arguable advantages of little-endian organization, except that a
contiguous list of bytes is big-endian, in essentially representing a single
word that is the entire input buffer.
There are some cases where the data is stored in lsb-to-msb form, that is to
say, the least-significant-bit to most-significant-bit, for example facsimile
and perhaps Deflate. What that means is that you can load a register full of
data stored in that way, and then simply increment instead of decrement the bit
offset for proceeding bits, ie, on the register the 3 bit sequences are bits 0,
1, 2, then 3, 4, 5, e.t.c., right to left.
So with the notion of scanning and variously processing fixed width packed
data, there are these kinds of variants:
There are lots of issues, and yeah, that's just for fixed length codes. That's
one reason why most uncompressed data formats are either bytes or specialized
hardware non-interchange data, with static fixed length codes.
So if you load and process bytes at a time, the rest of the register is just
sitting there idle, if you load registers at a time, the loop body is
There are 8 32-bit registers. This means we need to load only 1 register,
and then we throw away 1 byte. This isn't optimally efficient, but it's
pretty reasonable for bitfield operations.
> Here, the idea is that while it takes three words of input to get an
> number of 3 bit sequences, you only need to have on registers two at a
> because the fixed bitstring width is less than the friggin word width. If
> were one bit greater, then you'd still only need two registers, but if it
> two bits greater then three registers would be needed, because of the case
> where the the msb of the bitstring is the lsb of a register.
It takes 3 bytes of input. Since a modern x86 word is 32-bits, that's only 1
word of input, but 3 out of 4 times it involves unaligned cache accesses.
For 3 bits, you need only 1 register to hold input and 1 to hold the
unpacked bitfield. For 5-7 bits, you need 2 registers + 1 for output.
If the code is performance-intensive, it would be fair to unroll even more
to align to a word boundary and avoid all unaligned memory references.
> So anyways you can use a byte at a time instead of a register at a time.
> makes some things easier, but working with a register at a time gives
> times as much data, where the register mov instructions should be about as
> as a byte mov instruction, neglecting instruction encoding size. So
> with those 32 bits on a register, then there are the problems that, for
> example, one of the 3 bit codes is: bit 1, bit 0, bit 8, where the first
> packed 3 bit sequences at [input] are bits 7,6,5 and bits 4,3,2. If you
> the bswap instruction on eax after loading eax, then the bit sequences are
> 20, 29, and then 28, 27, 26, ..., to 1, 0, where then the next bit of that
> sequence is bit 31 of the big-endianized ebx.
There is no difference in encoding between word and byte instruction unless
there is an immediate.
This is a fairly bizarre scenario. I don't even know of a CPU that can
efficiently swizzle bits in such an arbitrary fashion. For that reason this
is extremely uncommon. The best solution I can think of is quite nasty and
would average 1 bit/2 cycles: test the bit you want to copy to get it into
CF and then use a shift or rotate with carry to accumulate the bits in a
> Another problem with using a register at a time is that the data buffer
> not be an integral number of 32-bit words, it is probably an integral
> number of
> 8-bit words with extraneous padding bits, and the buffer may hav enot been
> allocated to be on a 32-bit boundary, which would make it folly in terms
> performance to load the register with 32 bits that is not on a 32-bit
> as well, the array may be not allocated to an integral number of 32 bits,
> leading to extra checking to not overread the buffer.
Any standard allocator like malloc() will return buffers that are
word-aligned or better. Stack allocations are also word-aligned or better
depending on the necessity of the type. If the base of the buffer is
aligned, then you are guaranteed to not cause an exception by reading a full
word if the tail is only a partial word. This makes the final case much
simpler, though you still need to keep padding in mind.
> There are some arguable advantages of little-endian organization, except
> that a
> contiguous list of bytes is big-endian, in essentially representing a
> word that is the entire input buffer.
I don't follow at all.
> So with the notion of scanning and variously processing fixed width packed
> data, there are these kinds of variants:
> bitstring width
> word width
> bit order
> sequence count
> buffer size
> buffer alignment
> processing requirement
> and more!
Buffer size/alignment can be eliminated. Arguably this is what you want for
> There are lots of issues, and yeah, that's just for fixed length codes.
> one reason why most uncompressed data formats are either bytes or
> hardware non-interchange data, with static fixed length codes.
I think this is more due to the fact that bit swizzling is rather
time-consuming. Formats have evolved in a way that is fundamentally
compatible with the way CPUs are designed.
> So if you load and process bytes at a time, the rest of the register is
> sitting there idle, if you load registers at a time, the loop body is
It is not particularly more convoluted to load a full word at a time. You
replicate the logic on a single register instead of managing 4 byte
registers. It is actually more convenient.
One point is about the "memory buffer of bytes is a single big-endian
integer." The byte at the lower memory address is the more significant byte.
So say you have four bytes in a row at [x], [x+1], [x+2], [x+3] to process, if
you're processing the bits left to right then on the little-endian x86 then you
want to process [x] first but when you read four bytes into a register with a
single mov instruction, the contents of byte [x] have been placed at bits 7-0,
[x+1] at 15-8, [x+2] at 23-16, and [x+3] at bits 31-24. On a big-endian
processor, or processor in big-endian mode, where for example the PowerPC and
Itanium can operate in big- or little-endian mode, in big-endian mode on a 32
bit register [x] is at 31-24, [x+1] at bits 23-16, [x+2] at bits 15-8, and
[x+3] at bits 7-0. If you mov'd the contents starting at memory address [x]
onto a 64-bit register, then on an LE architecture [x] would be at bits 7-0, on
a BE architecture [x]'s contents would be on bits 63-56.
[x+3] 31-24 7-0
[x+2] 23-16 15-8
[x+1] 15-8 23-16
[x ] 7-0 31-24
To do with that is the consideration of the bytes that are filled in lsb-to-msb
but order instead of msb-to-lsb. That means instead of packing the most
significant bit of the data into the most significant bit of the byte or word,
that it goes into the least significant bit. That is to say, if the data is
binary 11001, then it is filled into a byte lsb first, as opposed to msb first
7 6 5 4 3 2 1 0 <- bit index
x x x 1 0 0 1 1 <-fill order lsb first
1 1 0 0 1 x x x <- fill order msb first, normal
Then, if a register is loaded with data from [x] on a little-endian processor
where the data was filled lsb-first, then to address the bits as they were
filled in order, which is the goal, then as above for [x] bits 7-0 contain the
first eight bits, but they contain them in the order 0-7. Then you can
proceeed in processing the word by plainly incrementing the bit offset.
Indeed, it may be feasible to use the BT instruction with plainly incrementing
the bit index, without the need to count in reverse order for each eight bits.
Another point is about "allocation" of the buffer on a word (register size)
boundary. If you call malloc or HeapAlloc the pointer to that memory is
probably evenly divisible by four, where as the word width in bytes is four,
that means that the beginning of the buffer is allocated on a word boundary.
Now, a problem might be that the data in question to process is at an offset
within that boundary, if for example raster data to unpack is at byte offset 3
or 333 or some other offset not divisible by the word width. When that
happens, then to get the sub-buffer of data containing the data to be processed
by the optimized register access code aligned on a word boundary, then there
are a couple options. One is to design the function to work on bytes, because
everything trivially aligns. Another option is to define the function to
start at a given bit offset within the register and pass it the address of the
previous word boundary as the buffer to process. Another option is to copy the
sub-buffer into an aligned buffer. On some architectures, it's better to copy
a misaligned buffer to an aligned buffer and then process with aligned access
than to use misaligned access for each element of the buffer. With many
situations there is enough control of the data in its specified form thus that
the sub-buffer is aligne don a word buffer due to the design of the data format
or the data parser implementation, but with other free-form compound
documents, for example, the data sub-buffer starts at an arbitrary byte offset.
It's important for the processor to have its data aligned. Unaligned word
access is like the telling the computer: build a fence over there, and then
move it a foot over.
So I want to try and figure out a good way to deal with the binary data,
starting with any byte and indeed any bit offset of a buffer, with fixed length
or static or dynamic variable length codes of less than half, more than half
and less than one, or more than one word size, filled in variously lsb or
msb-first order, organized in big- or little-endian words.
That is fairly well known amongst those who post here; and anyway, what
difference does it make? If you need to process "left-to-right" as you call
it then you can use bswap. Little endian makes more logical sense for such
operations as the word length makes no difference in how the data is
That is probably well-known to most of the readers of this group, but it's
barely well-known to me so I hope you don't mind if I go a litttle
pedagogical, because I'm trying to figure it out myself.
I'm in the process of implementing my little toy function for 1-D fax
decoding, and over the past week or so I've tried and tossed a couple
variations. Where I'm at is having totally rerolled the loop, and using test
reg/reg. There are still too many conditional jumps, especially one nasty
50/50 conditional jump smack-dab in the middle of the loop. i jump right to
the next instruction, not far off and back, the 50/50 just needs to call one
function. I feel that I need more registers because basically I'm short one
register. I have two variables that are used together, and only room on an
allocated register for one of them, as the other registers are allocated for
stuff that is used more often, of 7 usable registers and the stack pointer.
So I might try and put both of them on the stack and find a better use for a
register higher in the frequency table. Anyways that's about the notion of
naive register allocation.
One reason I have one less register is that I rerolled the loop and I'm not
using bswap. If I was able to process 32 bits at a time always, then I would
use bswap to more simply iterate through the bits on the input register, but I
have to stop and test each bitmask after testing a bit and then check if there
is any more input, because the input is in bytes and not words. So, because I
have to check each byte to see if that is the end of input because I have not
unrolled the loop, then I am shifting the bit test mask left 15 bits except
for when bit 24 was just tested and it's time for the next input word. You're
probably right, and I think you're right, about using bswap instead, and that
would surely be the case if I didn't have to have a conditional jump for each
I thought bswap was fast, it says it takes only one instruction cycle on the
Pentium Pro, but on the Pentium 4 it's supposedly bad. Basically I can put it
in this toy function, which otherwise will run on a 386. I have actually
about zero reason to run it on anything less than Pentium II, basically, but
if it does, that's great, and it doesn't look slower than using bswap, because
instead of bswap for each register I have three groups of an and a pair of
I figured out how to align and start on some sub-buffer unaligned address, I
assume that the buffer is aligned to some previous 32-bit boundary and just
use that, masking off the low two bits and starting with that address,
shifting the bit test mask with those low two bits, and then counting the
input byte-by-byte. So that way access is always aligned.
Anyways, I hope to get this toy function running here shortly, and have you
and other members of this group critique it. I appreciate your input.
Then, in the context of this thread, about the dang bits, I have some flickers
of insight. I want to figure out a macro set or generative framework to
embody some high-powered bit access patterns, and not even just for x86.
fixed length codes (eg 2, 3, 4, 8, 12, ...)
static variable length codes (eg 5-6-5)
dynamic variable length codes (eg Huffman, LZW, ...)
Then, there are as well less simple cases where for example static and dynamic
variable length codes are mixed, for example where a Huffman code is
immediately interpreted as a value and that determines the length of the next
bit sequence, say variable hybrid.
In considering the case of the fixed length codes, there are few complications.
One aspect of the data organization is the fill. Generically, the fixed length
codes are filled with their msb first into the output, msb first. There are
msb first msb first
lsb first msb first
msb first lsb first
lsb first lsb first
This is assuming that the output word width is the byte, and that the output
progresses from lower memory address to higher memory address, ie a huge
big-endian word. For example as above for a code with fixed length 5:
msb -> 11001 <- lsb
into a byte:
msb -> 76543210 <- lsb
Then in the order of the above list then it would be placed in a byte as
In some of the commodity (standardized, deployed) coding methods, the fill
order for code and word varies even within the coded data, for example filling
a Huffman code msb first into the output lsb first then following with a
machine integer filled lsb first into the output again lsb first, as is used in
The machine integer for its use has its lsb towards the lsb of the word, and
its msb towards the msb of the word. Otherwise it needs to be reversed, for
example for operations such as addition, or byte-swapped if the bytes are not
in the correct order, for example after reading a 32 bit big-endian (BE) word
on a little-endian (LE) machine, or vice-versa.
Now basically in terms of fixed length codes, the idea to make them tractable
for individual access is to copy the contents into a byte or other addressable
word. For example, in the case of 3 bit fixed lengths codes, FLC's, versus
SVLC's, DVLC's, and VHLC's, when they are packed into the bitstream it is often
not as convenient to work with them as when they are unpacked where each code
goes into its own byte, or for example with 12 bit codes where each code goes
into its own 16 bit word.
FLC: fixed length code
SVLC: static variable length code
DVLC: dynamic variable length code
VHLC: variable-hybrid length code
As we've covered there is a simple formula to determine how many words it takes
to contain an integral number of packed FLC's, where the word width is always
a power of two. Basically if the fixed length is not divisible by two, then
the number of required words is the length. If the fixed length is a power of
two 2^n, then the number of required words of length 2^m is ceil(n/m). If the
fixed length is divisible by two: x 2^n, then the number of words required to
contain an integral number of packed FLC's is some function of x, n, and m, eg
for a 12-bit FLC and an 8-bit word, three: x ceil(n/m).
There are a limited number of registers available for use, although the number
expands when a general-purpose register is not required and the processor has
expansion registers, either aliasing the floating point mantissae as is done
with MMX, or having extra registers as is the case with the XMM registers, on
x86 processors. Those registers don't support all the operations that can be
used with general purpose registers, for example the MMX registers do not have
the EFLAGS register set for conditional branching, but they do support some of
the logical operations on various levels of the processor, where Pentium II and
Pentium iii do introduce new operations on the MMX registers. I digress, where
the idea here is to determine some minimal operations that unpack FLC's into
sufficient word-width for the code, or in the obverse pack FLC's into a bit
As we've been discussing, on the LE architecture of the x86, where the lower
memory address contains the less significant byte, operating on register width
words requires either a convoluted bit offset for the generic case of msb/msb
filled data, if the input is to be read a register at a time. It is
particularly key to have aligned accesses of the register size data, because on
other architectures to be considered, an unaligned access isn't just the source
of a performance penalty, but also causes an error record to be generated,
leading to a crash of the program, for example unaligned access on the SPARC.
On the SPARC as it is a big-endian architecture, and not being x86 even not
part of the expressed charter, its big-endian organization leading to natural
order of msb/msb packed data is irrelevant.
We've put forth some high level descriptions of unpacking the FLC data into
"power of two" word width for its use as arguments to functions that expect
unpacked data or for algorithms that operate on unpacked data, yet there still
remains much to be considered in light of various eccentricities along the
lines of: scanline alignment, where basically that is resolved by treating the
input as an array of packed input arrays, thus about handling the edge case of
ignored bits at the end of the packed sequence. That leads to the
consideration that packed data may not even begin at a bit offset that is an
integral multiple of eight, not to mention the beginning of the array that is
not at a bit offset that is an integral multiple ofthirty-two.
When the fixed length code is of length that is a multiple of eight, there is
nothing to be done. For a fixed length code that is a multiple of two, that
has to do with every word width being a power of two. For the other cases of
FLC sizes, there are even other concerns.
FLC code sizes:
8, 16, 32, ... <- word width sizes
1 <- 1 bit
2, 4 <- code size evenly divides into word width
12 <- code size evenly divides into less bytes than code size, eg, 2 is a
3, 5, 7 <- code size only evenly divides into code size many words
7, 13, 29 <- code size is more than half of word width and shares no common
11, 17, 37, ... <- code size is greater than word width and shares no common
Thus, the implementation space rapidly explodes when considering arbitrary code
The static variable length code case basically reduces to the case of the fixed
length code that is the sum of the variable-length codes in a group. The
dynamic and hybrid variable length cases often become very specific, but again
by categorizing then broad assumptions about optimal implementations can be
Luckily, most data is stored in bytes, the 52 upper- and lower-case letters of
the Latin alphabet are encoded into bytes.
"In any case"
So, in reading the fixed length bit sequences, there are a variety of
Matt, Tim, Wolfgang, etcetera, comp.lang.asm.x86 for Intel processor x86
architecture assembly language coding, this writing is probably obvious
in several ways to you.
As soon as the bit sequences are being read (scanned) out of the
bitstream to form tokens, or symbols, then a question arises as to what
to do with the data. One notion is to scan the entire bitstream and
place each code into a word width or otherwise addressable memory
section, where it is then easy for other functions to operate on the
literal values, as machine integers or sample values, or within their
context, for example as glyph map entries.
When the fixed length code is not 8 bits, in msb/msb fill order, there
are requirements to process the bitstream to get those codes out of the
bitstream where they can be handled. In the case of the 3-bit codes,
for example, 3 bytes need to be processed at a time within one
procedure, or else a procedure has to be called one time for each code.
The use cases are all over the place, here are some considerations:
available memory for storing the "aligned" symbols
parallel processing cability and requirement, in the sense of other
registers processing scanned elements basically concurrently with the
few general purpose registers
parallel processing in the sense of multithreaded processing and
alignment of the input source to byte and word boundaries
Now basically my projects are not sophisticated enough to consider
multiprocessing, and there is a huge amount of memory on most
contemporary systems, even those from five or more years ago. So for a
fixed length code the only consideration is to scan all of them and put
each in some addressable memory location for later processing. It
_could_ be that a function would be better off scanning only a few
tokens at a time and then processing them, with regards to processing
that fits within the code cache along with the scanner, in keeping the
symbol in cache while processing, but still multiple symbols should be
processed at a time to keep the input on a register while it is being
processed. Another notion is that the input count needs to be tracked,
and that two procedures each operating on blocks, one on the input block
and the other on a temporary block of some manageable amount of symbols,
that each would have to track the length of input left, or to handle
some end-of-data symbol. So again it seems preferable to scan all of
the input tokens into the intermediate block, and then process each of
them. That might be not subtle nor sophisticated, but the idea is to
process them most efficiently.
So in terms of an assembler implementation as we have discussed, then
there is the consideration of the input being a contiguous bitstream,
with a bunch of codes written msb/msb. (msb of the symbol starting at
the free msb of the storage.) In the case of x86, the processor handles
integers in the LE, little-endian, byte order, and in general unaligned
accesses are bad, less instructions are better than more instructions,
simple instructions are better that more complex instructions,
conditional jumps are bad if reasonably unnecessary, cache misses are
bad, address complication is bad, not being a leaf function is bad:
each bad in the sense of detracting from some maximum performance.
The processor, a 32-bit processor, has general purpose registers with 32
bits each, the g.p. registers are 32 bits wide. When the processor
loads a value to fill a register, it is little-endian, so it loads the
byte (8 bit) value at the lowest memory address into the low byte of the
register, and the next byte at the next word to the left on the
register, and so on and so forth for the third and fourth eight bit
sections of the 32 bit value.
In a way that can be visualized as so. Say you had a list of digits,
addresssed from 1 to 1000, and you could store any digit from 0 to 9 at
each address. The little-endian method would store a 4 digit number
with the last, or least-significant digit, at the lowest memory
address. So when you store the number ABCD, (eg 5678, five thousand six
hundred seventy-eight), then it would be stored in the increasing memory
addresses with the places in order from the little-side, or the 1's
Then, when you read from address 1, it loads D, then C, then B, then A,
and places them on the register with D in the 1's place, or least
significant byte, then C to the left, then B, then A.
A B C D
Now that is obvious and well-known to experienced programmers, except
maybe about "loading", but it might take a while for those still
learning about it. Anyways, the fixed length codes are stored in those
bytes, with the first code in order being at the lower memory address,
and then variously in higher memory addresses.
So, when you read those four bytes onto a register, then on the register
D C B A
That's not so bad. It just means that instead of going from D->C->B->A
to process A->B->C->D. The problem with it is that those symbols are
not at issue or they are ready to process. Instead, each bit of the
four bytes (32 bits) starting at address 1 have to be read for each byte
A, B, C, D.
The bits on the 32-bit register are numbered 31 to 0, from most
significant but to least significant bit. So, where the codes are
msb/msb, then the first code is within the bits of byte A, which has
been loaded onto bits 7 through 0. So, intead of just being able to
process the bits starting at bit 31 annd going directly to 0, instead
processing has to start at bit 7, go to 0, then jump back to bit 15, go
to 8, jump back to bit 23, then to 16, and finally jump back to bit 31
and through to 24 to be ready for the next four bytes.
One solution for that is to re-order the bytes on the register so that
it is back to ABCD, then it is easy to process bits 31 through 0. The
machine instruction to do that, bswap, has to still split out of the
register each of A, B, C, and D, and reorder them.
So then in reading the fixed length codes out of the bitstream, that
reordering has to take place on little-endian architectures if 32 bits
of the data stream are to be handled at a time. It is good to deal with
32 bits at a time instead of 8 bits at a time, because bigger is better,
except sometimes big is bad. It's better to process thirty-two bits at
a time instead of eight, because then you don't have to load four times,
It's key that the load be aligned. That means if loading _four_ bytes
onto the register, that the address that you're loading from should be
divisible by _four_. The above example about addresses is not good in
explaining this because it starts at one instead of zero. So, where the
memory addresses start at zero:
Then , if you load four bytes starting at zero, that's good, because 0 %
4 = 0, the low two bits of zero are cleared, four evenly divides into
If, though you tried to load four bytes starting at address 1, that's
bad, because. It will still work on the Pentium/x86, but not the
Itanium without other processing, or the SPARC.
It will still work, but it would take something like ten times as long.
On Solaris/SPARC the unaligned_memory_access trap or something would be
enregistered or whatever, and then the program would abort with the
eloquent message: "Bus Error", although I read you can catch that trap
using supervisor instructions, which is TOTALLY IRRELEVANT. Loading
four bytes onto the 32-bit register from a memory address divisible by
four is good.
So anyways that goes back to "alignment of input source to byte and word
boundaries." If the input is not aligned to a word boundary, then if a
register is to be loaded, you should load the register starting at the
previous word-aligned address, and then start processing at the bit that
is 1/4, 1/2, or 3/4 through those four bytes, and then process
accordingly as each register is loaded in an aligned way.
mov eax, [input]
So, 32 bits are loaded, and then those are to be broken into 3-bit
sequences and each 3-bit sequence is to be put into its own byte, with
8-5=3 extravagantly wasted bits.
The register named eax, one of seven 32-bit g.p registers available to
the procedure, along with ebx, ecx, edx, and then esi, edi, and ebp, now
contains 32 bits of input.
mov ebx, [input+4]
mov ecx, [input+8]
mov edx, eax
add [input], 12
shr edx, 5
and edx, 0x07
Then edx contains the bits 7, 6, 5 of A in its low byte, which is also
called dl. For each of eax-edx, back on the 8086 there was only a-d,
then on the 80286 there are 16-bit ax-dx with each being split into high
and low bytes, eg ah and al, and then ax-dx are the low 16 bits each of
eax-edx. So, edx's low byte, dl, contains the first 3 bit sequence
pushed all the way to the right.
mov dl, [output]
Now, at the memory location with the address "output", there is the 3
bit code in 8 bits, and in the 96 bits of eax, ebx, and ecx there remain
93 bits, or 31 more 3-bit codes, to process.
Now, something besides ecx should be used because cl is is the only
register, as opposed to immediate or fixed literal, operand allowed with
the shr (aka shift right, >>) or shl (shift left) instructions, and that
value is going to be calculated, unless the idea is to totally unroll