I have developed a new compression library called QuickLZ which is
written in 80486 compatible assembler and is GPL licensed.
I have created benchmarks that include LZO, LZP, LZF and Zlib and it
seems that QuickLZ outperforms them :-)
http://quicklz.earthstorm.com/
Greetings,
Lasse Reinhold
Benchmarked here too.
http://cs.fit.edu/~mmahoney/compression/text.html
The demo fails on enwik9 (file too big). It is extremely fast on
enwik8, but I/O bound like several other fast programs on my PC. CPU
is about 2 ns/byte on a 2.2 GHz Athlon-64 for compression and
decompression, but 4x slower with disk I/O. For I/O bound programs,
better compression is faster because there is less compressed data to
read/write. The fastest decompression is CABARC, which compresses to
about half the size of QuickLZ.
-- Matt Mahoney
I repeated the enwik8 compression/decompression test each two times for
demo, lzop and thor at system see note 19
http://cs.fit.edu/~mmahoney/compression/text.html with 4 x SATA Seagate
750GB RAID5 (4 disks) Sandra Lite 2007.8.10.105 sequential read: 221
MB/s sequential write: 170 MB/s and used timer 3.01 process time =
kernel + user time.
demo 57,331,969 bytes
101.105704 Mbyte/s
1.609 sec
101.061142 Mbyte/s
1.578 sec
demo -d
143.413885 Mbyte/s
1.328 sec
141.705504 Mbyte/s
1.234 sec
lzop -1 53,437,564 bytes
1.421 sec
1.437 sec
lzop -d
0.593 sec
0.593 sec
thor ef 55,063,944 bytes
62.29 MB/s
1.531 sec
62.25 MB/s
1.531 sec
thor d
41.23 MB/sec
2.312 sec
41.25 MB/sec
2.312 sec
cabarc n 36,590,004 bytes
16.421 sec
16.406 sec
cabarc x
2.203 sec
2.187 sec
cabarc -m lzx:15 n 34.783.659 bytes
42:328 sec
42.328 sec
cabarc x
0,921 sec
0.875 sec
cabarc -m lzx:21 n 28.465.607 bytes
1:51:625 sec
1:52:359 sec
cabarc x
0,812 sec
0,828 sec
...if file compression/archiving is your only goal, yes. But there are
applications that use "real-time" decompression while other processes
are running, such as streaming assets for games. For such
applications, decompression complexity should be as low as possible
since the assets have to be streamed while the game is in progress.
> The fastest decompression is CABARC, which compresses to
> about half the size of QuickLZ.
If the OP wants to research what algorithm CAB uses, it's LZX:
http://en.wikipedia.org/wiki/LZX_%28algorithm%29
Hi, I took a look at the code and didn't see anything 486-specific
(BSWAP, etc.). The code should run fine on 386s. Did you mean to
write 486-compatible, or 386?
But that begs the question: What is the target platform for your
library? 386 and higher, or modern systems? I ask because your code
is 386-centric (uses CMPS, STOS, MOVS string instructions) but you post
benchmark times on very modern hardware. If your target is modern
hardware, there is a lot of assembly optimization you could do, such as
replacing string instructions with faster equivalents, or pipelining
the code into at least the U and V pipes.
Sounds very good. Could you please add THOR benchmarks to you site,
THOR is most of the time faster then LZO(P) :
Using MMX (for example) for matchlength comparing is faster but
increases grannularity from 4 to 8 bytes so that compression ratio
drops because matchlengths are usually very short. It was also found
that assembling bitfields of length 4...25 into the output stream using
plain 386 instructions is just as fast as using MMX registers as
buffer. I acturly timed 8 different constructions just for the
bitstream assembling part alone.
The same applies to other modern features like prefetching
instructions, branch hint prefixes, SSE2, etc - I've simply found
little or no use for them in the partcular problem, or their effects
were too marginal (2 Mbyte/s order) to trade backward compatiblity.
Porting to x64 would be interesting, though.
Now, for pairing. Because the code is tight, the order of the
instructions is almost locked (the span in which you can reorder an
instruction manually is far smaller than the lookahead of the
processor). Instead, I've tried to avoid unpredictable branches which
were timed to be very significant.
So, the implementation is pretty optimized even though only the plain
old instruction set of 386 is used. But feel free to experiment :-)
There's *far* more potential in optimizing the algorithm, though. Some
ideas that I will work on when I get time is to use byte sized
bitfields to avoid bitstream assembling and also to let a matchlength
pass beyond the current source pointer to save one bounds check.
I didn't know about THOR at release time but it looks interesting. I
have mailed the owner of Maximumcompression and he said that he will
benchmark QuickLZ at the next website update.
You can also look at Matt's and Sportman's timings in this thread
(thanks!).
I think you are talking to same person.
Fulcrum == Werner == Owner of MaximumCompression
Sachin Garg [India]
www.sachingarg.com | www.c10n.info
But you still haven't answered my question: What is the target
platform for your library? I ask this because, even if you don't want
to use MMX, etc., on Pentium it is faster for short matches to not use
rep movs and just move it manually. If you don't care about supporting
386/486, then I suggest you read some of the excellent Pentium
optimization guides on the 'net.
> There's *far* more potential in optimizing the algorithm, though. Some
> ideas that I will work on when I get time is to use byte sized
> bitfields to avoid bitstream assembling and also to let a matchlength
> pass beyond the current source pointer to save one bounds check.
Yeah, I noticed that. You have a ways to go, but so far it's very
promising. :-) (I can say this because I've also written x86
assembler decompression routines)
That's correct :)
The target platform is of course modern processors.
I havn't yet looked at the rep movsb in the decoder - I'll test if a
mov loop is faster soon.
I have, however, tested the cmpsb part of encoding thoroughly. One
construction was:
mov ebx, edi
sub ebx, esi
mov ecx, 68
next:
mov al, [esi]
xor al, [esi + ebx]
jnz diff
inc esi
loop next
diff:
This should move some work to an AGU by adding EBX and save an inc of
EDI. I also timed not using this trick. In both cases, I got around 56
Mbyte/s compared to 61 Mbyte/s using the current implementation, on a
Celeron M. There was the same order of difference on an Athlon64. The
test was on an XML file where LZ matches are dominating.
So, when to use SCAS and when to use other methods is a matter of
experimenting. In this particular problem, we have matchlengts with 50%
of all matchlengths being 0...4 or thereabout (3 bytes were already
matched before the loop) and matchlength testing must terminate at 68
bytes. With these particular constraints, REP SCASB suits best, at
least to my experience :-)
Again, if we for testing purposes remove the 68 byte limit on
matchlength (encoder will produce garbage), we get 67 Mbyte/s on the
Celeron, showing the importance of the algorithm.
> Yeah, I noticed that. You have a ways to go, but so far it's very
> promising. :-) (I can say this because I've also written x86
> assembler decompression routines)
Cool :-)
- Lasse Reinhold
Then there is a heck of a lot more you can do :-) Although I like that
it is 386 compatible ;-)
Maybe split the sources -- keep existing as 386-compatible, and
alternate sources for modern machines?
> I have, however, tested the cmpsb part of encoding thoroughly. One
> construction was:
>
> mov ebx, edi
> sub ebx, esi
> mov ecx, 68
> next:
> mov al, [esi]
> xor al, [esi + ebx]
> jnz diff
> inc esi
> loop next
> diff:
LOOP is slower than DEC CX; JNZ starting on 486 and higher, btw.
> Again, if we for testing purposes remove the 68 byte limit on
> matchlength (encoder will produce garbage), we get 67 Mbyte/s on the
> Celeron, showing the importance of the algorithm.
Speaking of which, at no point on the website or source do you have an
explanation of the algorithm. Slogging through source code to discover
the algorithm isn't fun; can you provide a brief description of the
process and output?
Yes, you'd have to split the sources because adding conditional
branches at the places where you want to use modern features slow down
the code on most processors, even if the branches are such that they
are *not* taken on the modern processors (letting execution continue at
the following instruction, not wasting any code cache). I don't know
why, there are plenty of TLBs for prediction :-/
This is also the reason why I havn't added different compression
levels; A single "never taken" or "always taken" branch that makes the
encoder use an 8 Kbyte or 2 Kbyte hash table slows down compression 2-3
Mbyte/s.
> Speaking of which, at no point on the website or source do you have an
> explanation of the algorithm. Slogging through source code to discover
> the algorithm isn't fun; can you provide a brief description of the
> process and output?
I've added a brief description of the algorithm at the end of the
manual: http://quicklz.earthstorm.com/manual.html
- Lasse Reinhold
> I've added a brief description of the algorithm at the end of the
> manual: http://quicklz.earthstorm.com/manual.html
>From the manual: "matchlength, both in form of bitfields of variable
length" -- what convention are you using to store variable lengths?
If you look in qlz_initialize() in the source, an offset can be encoded
in one of following four ways, depending on its value:
a) 00000000 00000000 0000c001
b) 00000000 000000cc cccc0011
c) 00000000 000ccccc ccccc101
d) 0000000c cccccccc ccccc111
The c's are the bits that contain the value and the value is biased, so
that c = 0 in case a means 3; cccccc = 0 in case b means 5; cccccccccc
= 0 in case c means 69, etc. RLE-lengths and matchlengths work in the
same way. So, it's a kind of static huffman.
I am, however, working on a new byte aligned encoder where offsets and
runlengths are combined and only following 5 bitfields are possible:
// oooool 00
// oooooooo oooool 01
// oooooooo oollll 10
// oooooooo oooooool lllll 011
// oooooooo oooooool llllllll llll 0111
Here, o are offset bits and l are matchlengt bits. This schema has in
my prototyping environment shown to give ~10% better text compression
and ~4% better binary comression. It also has potential for *extremely*
fast decoding.
- Lasse Reinhold
Yes, now you're talking. I await your next benchmarks where you
outperform LZO :-)
Added the results to the maximumcompression site, I will quote a piece
to illustrate why it did not make it to first place:
It scores very well on the MFC speed comparison, but it could do even
better, as actual compression/decompression takes place in (less then)
half the times listed. The program flushes is buffers entirely before
returning to the command prompt, loosing valuable seconds. BTW For MFC,
I tested a special version with the 100 Mb limit removed.
Faster how? Compression, decompression, or both?
Both
That's not my impression - thor is "very" slow at decompression. I've
tested an arbitrary swapfile.sys fragment (10 MB compresses into ~3.7
MB) and a database backend (3 MB compresses into ~400 KB). Both were
compressed and decompressed 10 times and the results are given in
seconds with the two files separated by semicolon.
The timings are done on a ramdrive:
Decompression:
lzo: 0,91 ; 0.25
thor: 3,09 ; 0.75
Compression:
thor: 1.72 ; 0,44
lzop 1.86 ; 0.35
I can't compare with quicklz using files because i have no proper demo
project for that yet (existing demo adds 0.20 - 0.38 seconds per run in
a GHz detection loop, doesn't use disk write buffering, etc). I'll fix
it in the next byte aligned version.
- Lasse Reinhold
Sorry, you are right, I somehow misread the question. I assumed he was
asking for the impact of removing the 'full buffering' on the
(de)compression speed of QuickLZ. My mistake.
Greetings,
Lasse Reinhold