Added couple of test data sets, trying to run baseline numbers

4 views

Skip to first unread message

Tatu Saloranta

unread,

Mar 26, 2011, 12:52:14 AM3/26/11

to jvm-compress...@googlegroups.com

Ok; now that I have added following codecs:

* LZF
* QuickLZ
* Gzip/deflate (JDK one, JCraft)
* Bzip2 (from commons codec)

it is important to get various sets of test data available. Luckily it
seems there's plenty to choose from, so I have included following sets
under testdata/:

- Calgary and Canterbury corpus (sort of well-established benchmarks)
- maximumcompression test set
- test files from http://quicklz.com/bench.html

and will try to get them run on my minimac, just to get a baseline. If
anyone else has time & interest to run these, that would also help.
I am planning to add resulting html pages within github, and link from
project wiki, once I get admin access to project, or someone enables
"gh-pages" (which is the easyish way to add stuff).

Beyond this, it'd be great to get more codecs; either native java
ones, or if we must, JNI-wrapper based ones. I don't want to include
anything that would require using shell to run, but JNI is sort of
acceptable.

So far results are interesting: many codecs are fast, and range is
huge. The only really slow one (relatively speaking) is bzip2; I am
tempted to check out if there's anything that can be done to improve
it: I know algorithm is not designed for speed, but it seems like
there might be room for improvement.
On the other hand, I am pretty happy with speed of the fastest codec,
LZF; its compression speed is particularly impressive. But even
JDK/gzip is plenty fast when used the right away (it uses native codec
I think).