ok, nifty idea. in my own LZ decoder, these cases are currently a less
efficient case, as they have involve falling back to byte-based copies
rather than 32-bit or 64-bit copies.
I tried using 128-bit SSE copies, but found that on average (for the
often short LZ matches) this costs more than using 32 or 64 bit GPR copies.
like the LZ4 decoders I had looked at, the copy operations may write a
few bytes past the end of the span.
> In my own optimization to the LZ4 reference code I use unaligned SSE
> vector stores to handle any repeated pattern of lengths from 1 to 16 bytes:
>
> I load 16 bytes starting from where the reference pattern starts, then I
> use a pair of table-driven permute operations to repeat the relevant
> bytes across a pair of SSE regs, before I use a store loop like this to
> fill out the target memory area:
>
>
> do {
> _mm_storeu_si128 ( target, sse_reg_1);
> _mm_storeu_si128 ( target+16, sse_reg_2);
> target += update_length;
> } while (target < block_end);
>
> update_length is of course the largest multiple of the pattern length
> that is less or equal to 32, i.e.
>
> update_length = (32 / pattern_length) * pattern_length;
>
> except that I simply store that alongside the permutation tables.
>
ok, may have to try similar.
>> a slightly less naive loop, ex:
>> cs=src; cse=src+sz; ct=dst;
>> while(cs<cse)
>> *ct++=*cs++;
>>
>> tends to peak out at around 900.
>> it is ~ 1.5GB/s if copying 32 bits at a time (int32), and 2.0 GB/s if
>> using SSE operations (MOVDQU).
>>
>> I have a Xeon box where a similar MOVDQU copy pulls off 2.5GB/s, but
>> int32 copy is 1.0 GB/s.
>
> Yeah, unaligned 16-byte stores tend to vary significantly in speed
> betwen cpu models. Most handle all stores within a cache line with zero
> extra cost, crossing cache lines (which happens ~25% of the time) costs
> an extra cycle or three, while page crossings can be so expensive as to
> significantly impact the average speed.
yeah. for larger copies, they seem to be the fastest option, but tend to
suffer a bit for small irregular copies.
ex:
while(cs<cse)
{ *(u64)ct=*(u64)cs; ct+=8; cs+=8; }
may be faster than:
while(cs<cse)
{
x0=_mm_loadu_si128(cs);
_mm_storeu_si128(ct, x0);
ct+=16; cs+=16;
}
though, cases where one has a match longer than 64 bytes or so (seems
roughly break even) are uncommon enough that it seems to end up spending
overall more time in the 'if()' branch for whether to use an SSE copy,
so it works out cheaper to just use a GPR copy.
could depend a lot on the data being decoded though.
>>
>>
>> I have a VQ codec, as noted, pulling off 1.0 to 1.5 GB/s (255 to 384
>> megapixels/second) on the BGRx decode path, for a single-threaded
>> plain-C decoder.
>>
>> this is around 2x as fast as the H.264 HW decode on this PC, and 3x to
>> 4x as fast as XviD.
>>
>> though, like LZ4 vs Deflate, the VQ codec does have worse compression,
>> but it can passably do 720p30 in 8Mbps and 1080p in 12Mbps, so doesn't
>> seem too horrible (while pretty high bitrates vs YouTube videos, this is
>> a bit lower bitrate than typical ATSC broadcast TV using MPEG-2).
>
> Nice!
yep.
this particular design (BT1H) was originally meant for my robot
projects, for use in streaming video encoding on ARM based hardware
(700MHz ARM11), as an attempt to improve the quality/bitrate over its
predecessor (at a hopefully modest speed cost, BT1G was a similar
design, but lacked an entropy coder, opting instead for a fixed-format
byte-oriented design).
however, it has done a bit better in general than expected, and is
currently doing better than my previous VQ codecs in many areas.
I have thus far mostly been using it for desktop capture, as the encoder
is currently fast enough on my K10 to encode 1080p30 at ~15% CPU load.
the current single-threaded encoder maxes the encode thread at around
45-50 fps though.
in tests it is fast enough to encode 480p30 on a Raspberry Pi (700MHz
ARM11), and (still single threaded) 720p30 on a Raspberry Pi 2 (900MHz
ARM Cortex A7 quad-core). on the RPi2, 1080p30 could be theoretically
done with a tandem encoder design.
its per-thread decode speed still doesn't reach an older codec (BT1C) in
certain modes, but it does match the old speeds in the mode I typically
used it (essentially RPZA-like with 23-bit color endpoints and using my
BTLZH encoder as an entropy backend).
basically, in certain edge cases, my older codec could reach absurd
decode speeds (gigapixel/second territory for multi-threaded decoding).
however, these weren't really "actually useable" scenarios (for more
than a benchmark, things would be seriously IO bound).
however, quality/bitrate in BT1H is significantly better than BT1C (1).
at the time, BTLZH (an extended Deflate variant) reached a speed of
around 250 MB/sec for decode, but I had more recently micro-optimized
BTLZH to 475 MB/sec. though, in pure Huffman tests, it is still a little
slower than ZLIBH and FSE-HUF (140MB/s vs 200MB/s).
BTLZH basically extends Deflate slightly (and remains backwards
compatible by design), getting BZ2 like compression but much faster
decoding, mostly by using a larger sliding window and borrowing a few
tricks from LZMA (though sticking with static Huffman, adding the
ability to predict match lengths and offsets from previous matches).
apparently the design competes in the same general space as Zstd.
both seem to have otherwise fairly comparable stats.
superficially, BT1C and BT1H are similar in that both use 4x4 pixel
blocks, and blocky VQ, with each block typically as 4x4x2b with 2-bits
encoding a value interpolated between a pair of endpoints.
a bit of a difference though comes up in terms of how color endpoints
are handled:
BT1C primarily used a pair of 23-bit RGB endpoints (R8G8B7);
BT1H uses YUVD or YUVDyuv, or essentially a vector in YUV space
(essentially the JPEG YCbCr colorspace).
YUVD could encode a pair of endpoints differing in luma, whereas YUVDyuv
encodes a pair of endpoints in luma/chroma space.
BT1C used BTLZH as the entropy backend, whereas BT1H uses AdRice and a
raw bitstream.
BT1H also allows more readily using a number of different types of
blocks, ex:
flat (no bits);
2x2x1 (4 bits);
2x2x2 (8 bits);
4x2x2 (16 bits);
2x4x2 (16 bits);
4x4x1 (16 bits);
4x4x2 (32 bits);
4x4x3 (48 bits);
4x4x2Y2x2x2UV (48 bits, YUV 4:2:0);
4x4x3Y2x2x2UV (64 bits, YUV 4:2:0);
4x4x2YUV (96 bits, YUV 4:4:4);
4x4x3Y4x4x2UV (112 bits, YUV 4:4:4).
the type of block being chosen per-block based on quality level and
heuristics.
the 4:2:0 and 4:4:4 blocks interpolate YUV from YUVDyuv endpoints for
individual pixels. these are used for blocks with high chroma contrast
(there was an issue with my past VQ codecs I called "the Rainbow Dash
issue", where basically sharp transitions in chroma, such as Rainbow
Dash's mane, look terrible absent having blocks which can deal with
higher chroma resolution).
however, for the majority of blocks, the contrast is a bit lower, and
cheaper blocks can be used (fewer bits and less resolution).
for most blocks, the YUV<->RGB conversion is per-block, not per-pixel.
lower quality levels force the encoder to use more cheaper blocks, and
to use harsher quantization on the color-vector deltas.
there were a few block types which achieved YUV 420 and 444 via encoding
more color-points, but these aren't really used at this point as they
compress worse and are slower to encode/decode.
generally, the currently used blocks all use a single color-vector per
block, just the higher-chroma blocks interpolate the components
independently, as opposed to interpolating between a pair of points.
a few of the directly-coded 420 and 444 blocks could conceivably be used
for lossless encoding, but the current decoder is not capable of doing so.
>>
>> the YUY2 decode path is faster in an Mpix/sec sense (gets around 500
>> Mpix/sec), but lower in terms of raw bytes (YUY2 only uses 1/2 as many
>> bytes as 32-bit BGRx).
>>
>> though, a lot of this is mostly because the VQ decode process does a lot
>> less work per pixel vs a traditional transform-based codec, and also in
>> many cases, producing output pixels is a series of 32-bit integer stores.
>>
>>
>> another (sort of) trick is handling Adaptive Rice coding in a similar
>> manner to conventional static-Huffman coding, which allows both AdRice
>> and SMTF+AdRice to perform similarly to static Huffman, but AdRice is a
>> little cheaper overall as it has less overhead than Huffman.
>>
I am not sure if the current video codec could work-out as-is if Huffman
coded. the original use-case it was designed for are mostly why Rice is
used.
I am not sure it is currently the best option for block-commands, which
currently make up a fair part of the bitstream at lower bitrates, and
seem highly saturated.
it is possible low-bitrate operation could benefit from the use of an
optional secondary range-coder step (say, feeding Rice-coded bits
through a bitwise range coder), but I have reservations here.
luckily at-least AdRice should be a little easier to mesh well with a
bitwise range-coder than Huffman was. TBD if it is worth the effort to
implement and test this case.