Compress/decompress using prefix-free codes (Huffman)

14 views

Skip to first unread message

Dilawar Singh

unread,

Sep 26, 2017, 9:04:58 AM9/26/17

to Web and Coding Club IIT Bombay

Attached is a python script. It can be used to compress/decompress a text file. On large files (more than 5MB), one can reduce the size of the file by a factor of ~ 0.35 which is not far from the best compression ratio. Since it is in pure Python, it is slow. This script requires module huffman and bitarray.

pip install huffman bitarray --user

The script accepts two options -c or compression and -d for decompression. Following is help message produced by the script.


./compress -h
usage: compress [-h] [--compress] [--decompress] file

Compress/Decompress a file using Huffman codes.

positional arguments:
 file File to compress/decompress.

optional arguments:
 -h, --help show this help message and exit
 --compress, -c Compress file

This script is intended for Information Theory students who are looking for a python script to play with.

We read the input file and compute the frequency of each character (Counter from standard python library does the job). We pass this frequency information to huffman module and ask it to give a code-book (Huffman code). Next, we replace each character with code and save it to a binary file. We also add the code-book to the file as prefix. Adding code-book is necessary if we want to decompress it. The Huffman code gives the shorted binary code to the symbol which occurs most frequently, and longest to the one which occur most rarely. I ran the script (with time command) on a large file (76MB).


time ./compress ~/Downloads/camk12_pp54_voxel2_diff--PP1--1e-13-_suOFF.dat -c
Reading /home1/dilawars/Downloads/camk12_pp54_voxel2_diff--PP1--1e-13-_suOFF.dat (76 MB).. done.
Generating codebook.. done.
Compressing files .. done.
Average codeword length : 2.924465
| Optimal average code length: 2.869759
Compressed files is written to /home1/dilawars/Downloads/camk12_pp54_voxel2_diff--PP1--1e-13-_suOFF.dat.dx
| Original file size : 80169086
| Compressed file size : 29307774
| Compression ratio : 2.735421

real 0m40.917s
user 0m40.822s
sys 0m0.096s

gzip takes less than a second to compress it (0.75 sec) and this script took more than 40 secs.

The point of this script is demonstration, and not the efficiency. gzip reduced the file to size 2878272 bytes. Ours (29307774 bytes) is not bad. Note that we are storing our codebook in plain text and not in binary format.

We also print the Optimal average code length which is information theoretic optical average codeword length; no one can do better than this. And Huffman codes are quite close to this number.

On small file, you won’t see much reduction.

To decompress, we replace the code-word with the symbol (from the embedded codebook) and write to the output file. The compressed file can be open in text-editor to see the codebook.

File can be found on github.

compress

Reply all

Reply to author

Forward

0 new messages