Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

info: using lzma compression

15 views
Skip to first unread message

Grant

unread,
Jul 22, 2008, 2:06:40 AM7/22/08
to
Hi there,

After seeing filename.tar.lzma tarballs for dnsmasq recently I looked
into using lzma here, with slackware-11.0.

Firstly, why bother? Well, on a 2.3MB datafile lzma compression is
a lot better than bzip2:

-rw-r--r-- 1 grant wheel 3995 2008-07-22 10:12 ip2c-names
-rw-r--r-- 1 grant wheel 2304896 2008-07-22 10:13 ip2c-data
-rw-r--r-- 1 grant wheel 612749 2008-07-22 15:23 ip2c-database.tar.bz2
-rw-r--r-- 1 grant wheel 293557 2008-07-22 15:35 ip2c-database.tar.lzma

Lzma compression comes from the window world's 7zip archiver. 7zip publish
an SDK under the LGPL. I downloaded the GPL'd unix source from:

http://tukaani.org/lzma/
http://tukaani.org/lzma/lzma-4.32.6.tar.gz

And had no problems compiling / installing the lzma utilities. Next was
to add lzma to tar. There are patches in the source tarball but they don't
match the tar versions included with slack-11.0 or slack-12.1. Another
wrinkle is that slackware has two versions of tar installed, one for
pkgtools and the other for userspace (slackware-12.1):

grant@pooh:~$ ls -l /bin/tar*
-rwxr-xr-x 1 root root 233196 2006-12-14 16:37 /bin/tar*
-rwxr-xr-x 1 root root 115036 2006-12-14 16:37 /bin/tar-1.13*
lrwxrwxrwx 1 root root 3 2008-05-26 13:45 /bin/tar-1.16.1 -> tar*

Before patching tar myself, I checked for the latest version and found the
latest tar-1.20 does support lzma, but not with a single letter option (-a)
that the lzma utilities author used.

I ran the usual ./configure; make; su; make install and let tar-1.20 install
under /usr/local so it doesn't interfere with the slack tar, tar-1.20 is seen
first on the $PATH.

See: http://www.gnu.org/software/tar/ for the latest tar source.

The new tar -a option compresses a file according to the target filename
suffix:

grant@deltree:~/ip2c$ time tar cvaf ip2c-database.tar.bz2 ip2c-data ip2c-names
ip2c-data
ip2c-names

real 0m4.452s
user 0m4.230s
sys 0m0.130s
grant@deltree:~/ip2c$ time tar cvaf ip2c-database.tar.lzma ip2c-data ip2c-names
ip2c-data
ip2c-names

real 0m16.886s
user 0m16.549s
sys 0m0.253s

So you can see lzma takes much longer to compress the same files, but
decompression time is much faster (these times are on a 500MHz Celeron).

grant@deltree:~/ip2c/xxx$ time bzcat ../ip2c-database.tar.bz2 |tar xv
ip2c-data
ip2c-names

real 0m1.306s
user 0m1.150s
sys 0m0.153s
grant@deltree:~/ip2c/xxx$ time lzcat ../ip2c-database.tar.lzma |tar xv
ip2c-data
ip2c-names

real 0m0.484s
user 0m0.347s
sys 0m0.140s

Unfortunately there's no single letter option for tar's lzma decompress
like tar xvjf for bzip2, and: 'tar xvf ../ip2c-database.tar.lzma --lzma'
looks clumsier to me than the lzcat ... above.

The large datafile I'm compressing is very repetitive, with about 92k
records like:

117440512 134217727 US
134217728 150994943 US
150994944 167772159 US
167772160 184549375 ZZ

The lzma web page claims: "Average compression ratio of LZMA is about
30% better than that of gzip, and 15% better than that of bzip2."

Grant.
--
http://bugsplatter.mine.nu/

0 new messages