On 9/28/2013 7:45 AM, Terje Mathisen wrote:
> Vince Weaver wrote:
>> On 2013-09-27, Terje Mathisen <"terje.mathisen at
tmsw.no"> wrote:
>>>
>>> Have you looked at LZ4?
>>>
>>> I really like this algorithm due to the combination of an easy to
>>> explain/implement algorithm, pretty good compression ratios, and
>>> _really_ fast decompression.
>>
>> I haven't looked at it before.
>>
>> Hmm, someone has posted LZ4 decompression code for 8088 that fits in
>> 79 bytes.
>
> Is that the "OldSchool Rambler" or is his code even smaller?
>>
>> My 8086 LZSS code is 58 bytes, although in theory the difference could
>
> Wow!
>
yeah.
me idly remembering trying to fit useful amounts of stuff into a
512-byte bootsector.
some people were off loading a kernel from a filesystem, decompressing
it, setting up P-Mode, and jumping to the kernel, somehow all within a
single boot-loader.
I generally just used the strategy of loading the kernel and a
"second-stage" loader (simpler, less hairy), which would itself handle
any decompression, image loading, and setting up P-Mode (generally at
the time was using FAT12/16/32 and LZ + PE/COFF).
I could probably do it all a little better now, but alas...
it would be hard to beat out just using an existing kernel (such as the
Linux kernel), which at the time, this realization killed the effort
(like, getting it working, only to realize just how far it was behind
being a "real" OS).
yep.
x86 instruction encoding strategy:
throw a bunch of magic prefix bytes at the problem.
had more of the 1-byte space been used for 2-byte opcodes, maybe there
would probably be less overly long instruction forms.
like, a little less:
660FXX/r and F30FXX/r and 660F38XX/r and similar.
> Yes, they can process 8/16/32/64 items with a single instruction, but
> the processing loop will be significantly more complicated than a naive
> REP SCAS/REP MOVS version.
>
I once did a test, and found that past a certain size, on the hardware I
was using at the time (early 2000s), the method used to copy memory
didn't really make much of a difference WRT performance.
so, it was like a simple "REP MOVSB" loop, vs using SSE operations for
copies, vs running an LZ77 decompressor, all running at pretty much
*exactly* the same speed.
with a newer computer with DDR3, there doesn't really seem to be such an
obvious speed-ceiling though (so, the choice of copy operation matters a
little more).
also, the CPU and RAM have gotten sufficiently faster relative to the
HDD that storing data in a compressed form and decompressing in RAM can
be faster (*1).
though, granted, in many cases, a lot of these gains can offset by the
OS's disk-cache, where file-IO is pretty fast if the data happens to be
being cached in RAM by the OS (but is much less likely for large files).
*1: I suspect current high-res video playback is only really possible
due to this: uncompressed 1080p would require more bandwidth than is
available from a typical HDD.
granted, there is a wide range of possibilities:
H.264: compresses well, but is a little expensive to decode;
H.263: compresses slightly worse, but is notably faster;
Theora: seems fairly similar to H.263 as far as I can tell;
...
RPZA: compresses pretty bad, but very fast decoders are possible (*2).
*2: for some use-cases, like animated textures (very short looping
animations), it seems pretty good, but not so great for video (in
general). for video in 1080p or 4k resolutions, disk IO is likely to
kill performance, likely at minimum making entropy coding necessary, but
giving more of an advantage to other formats.
actually, disk-IO is a similar issue for trying to use DDS (trying to
load DDS files is somewhat IO bound). IME, most of the
size/quality/speed tradeoff regarding texture loading goes to a
customized JPEG variant (which also works pretty well for animated
textures, but still isn't really great for video).
Theora does pretty good for video, but isn't really a particularly great
option for animated textures. it is likely a pretty good option for
things like cutscenes or similar though.
or such...