Pbzip

0 views

Skip to first unread message

Berenguer Miramontes

unread,

Aug 5, 2024, 2:45:50 AM8/5/24

to orgacinam

Assumea fast, modern multi-core machine and a nice Internet connection. Using this setup, we install a package from AUR using the PKGBUILD system (yaourt, pacaur, makepkg or whatever).

-> Thanks to the fast Internet connection the download is speedy

-> Thanks to j8 (or j16, ...) option in the global PKGBUILD config, compilation is parallelized and thus optimally fast

-> However one bottleneck remains: The final step is compressing the package before it is given to pacman for installation. This step may take a very long time (for example if the package is 1GB large) and is inefficient because it's just using one core.

In other applications, we can use pbzip(2) to compress in parallel and the performance is awesome (for dd from mechanical HDDs even real-time). So my question is: Is it possible to use pbzip or pbzip2 in order to compress a package for pacman, and, if yes, how is it done?

If you're just building the packages for personal use rather than distribution then you can always just skip compression completely, you just end up with a larger package; you're likely to be decompressing them straight away when installing anyway.

You can set the compression command to whatever you want, but really the smart thing would be to do as slithery said and disable compression altogether. Compression is primarily useful for reducing network download speed, which quite obviously assumes you are hosting a repository for public consumption.

I was under the impression that reproducible builds were all about the binaries. Which would presumably tell you more about the actual differences anyway, rather than looking at an additional layer of indirection via xz-compressed binaries.

The algorithm has gone through multiple maintainers since its initial release, with Micah Snyder being the maintainer since June 2021. There have been some modifications to the algorithm, such as pbzip2, which uses multi-threading to improve compression speed on multi-CPU and multi-core computers.

Seward made the first public release of bzip2, version 0.15, in July 1996. The compressor's stability and popularity grew over the next several years, and Seward released version 1.0 in late 2000.[not verified in body] Following a nine-year hiatus of updates for the project since 2010, on 4 June 2019 Federico Mena accepted maintainership of the bzip2 project.[4] Since June 2021, the maintainer is Micah Snyder.[5]

Any sequence of 4 to 255 consecutive duplicate symbols is replaced by the first 4 symbols and a repeat length between 0 and 251. Thus the sequence AAAAAAABBBBCCCD is replaced with AAAA\3BBBB\0CCCD, where \3 and \0 represent byte values 3 and 0 respectively. Runs of symbols are always transformed after 4 consecutive symbols, even if the run-length is set to zero, to keep the transformation reversible.

The move-to-front transform again does not alter the size of the processed block. Each of the symbols in use in the document is placed in an array. When a symbol is processed, it is replaced by its location (index) in the array and that symbol is shuffled to the front of the array. The effect is that immediately recurring symbols are replaced by zero symbols (long runs of any arbitrary symbol thus become runs of zero symbols), while other symbols are remapped according to their local frequency.

Much "natural" data contains identical symbols that recur within a limited range (text is a good example). As the MTF transform assigns low values to symbols that reappear frequently, this results in a data stream containing many symbols in the low integer range, many of them being identical (different recurring input symbols can actually map to the same output symbol). Such data can be very efficiently encoded by any legacy compression method.

Long strings of zeros in the output of the move-to-front transform (which come from repeated symbols in the output of the BWT) are replaced by a sequence of two special codes, RUNA and RUNB, which represent the run-length as a binary number. Actual zeros are never encoded in the output; a lone zero becomes RUNA. (This step in fact is done at the same time as MTF is; whenever MTF would produce zero, it instead increases a counter to then encode with RUNA and RUNB.)

The sequence 0, 0, 0, 0, 0, 1 would be represented as RUNA, RUNB, 1; RUNA, RUNB represents the value 5 as described below. The run-length code is terminated by reaching another normal symbol. This RLE process is more flexible than the initial RLE step, as it is able to encode arbitrarily long integers (in practice, this is usually limited by the block size, so that this step does not encode a run of more than 900000 bytes). The run-length is encoded in this fashion: assigning place values of 1 to the first bit, 2 to the second, 4 to the third, etc. in the sequence, multiply each place value in a RUNB spot by 2, and add all the resulting place values (for RUNA and RUNB values alike) together. This is similar to base-2 bijective numeration. Thus, the sequence RUNA, RUNB results in the value (1 + 2 2) = 5. As a more complicated example:

Several identically sized Huffman tables can be used with a block if the gain from using them is greater than the cost of including the extra table. At least 2 and up to 6 tables can be present, with the most appropriate table being reselected before every 50 symbols processed. This has the advantage of having very responsive Huffman dynamics without having to continuously supply new tables, as would be required in DEFLATE. Run-length encoding in the previous step is designed to take care of codes that have an inverse probability of use higher than the shortest code Huffman code in use.

If multiple Huffman tables are in use, the selection of each table (numbered 0 to 5) is done from a list by a zero-terminated bit run between 1 and 6 bits in length. The selection is into a MTF list of the tables. Using this feature results in a maximal expansion of around 1.015, but generally less. This expansion is likely to be greatly over-shadowed by the advantage of selecting more appropriate Huffman tables, and the common-case of continuing to use the same Huffman table is represented as a single bit. Rather than unary encoding, effectively this is an extreme form of a Huffman tree, where each code has half the probability of the previous code.

As an overview, a .bz2 stream consists of a 4-byte header, followed by zero or more compressed blocks, immediately followed by an end-of-stream marker containing a 32-bit CRC for the plaintext whole stream processed. The compressed blocks are bit-aligned and no padding occurs.

Because of the first-stage RLE compression (see above), the maximum length of plaintext that a single 900 kB bzip2 block can contain is around 46 MB (45,899,236 bytes). This can occur if the whole plaintext consists entirely of repeated values (the resulting .bz2 file in this case is 46 bytes long). An even smaller file of 40 bytes can be achieved by using an input containing entirely values of 251, an apparent compression ratio of 1147480.9:1.

A compressed block in bzip2 can be decompressed without having to process earlier blocks. This means that bzip2 files can be decompressed in parallel, making it a good format for use in big data applications with cluster computing frameworks like Hadoop and Apache Spark.[8]

bzip2 compresses most files more effectively than the older LZW (.Z) and Deflate (.zip and .gz) compression algorithms, but is considerably slower. LZMA is generally more space-efficient than bzip2 at the expense of even slower compression speed, while having much faster decompression.[9]

bzip2 performance is asymmetric, as decompression is relatively fast. Motivated by the long time required for compression, a modified version was created in 2003 called pbzip2 that used multi-threading to encode the file in multiple chunks, giving almost linear speedup on multi-CPU and multi-core computers.[12] As of May 2010[update], this functionality has not been incorporated into the main project.

Like gzip, bzip2 is only a data compressor. It is not an archiver like tar or ZIP; the bzip2 file format does not support storing the contents of multiple files in a single compressed file, and the program itself has no facilities for multiple files, encryption or archive-splitting. In the UNIX tradition, archiving could be done by a separate program producing an archive which is then compressed with bzip2, and un-archiving could be done by bzip2 uncompressing the compressed archive file and a separate program decompressing it. Some archivers have built-in support for compression and decompression, so that it is not necessary to use the bzip2 program to compress or decompress the archive. GnuPG also has built-in support for bzip2 compression and decompression.

Bzip2 and gzip only use one core, although many computers have more than one core. But there are programs like lbzip2, pbzip2 and pigz, which use all available cores and promise to be compatible with bzip2 and gzip.

pigz-2.1.6, which is included in Precise Pangolin, refuses to decompress files with unknown suffixes (e.g. initramfs-*.img). This is fixed in pigz-2.2.4, which ships with Quantal. So you might want to wait until Quantal, install the Quantal package manually, or don't link gunzip/gzcat/zcat yet.

The symlink answer is really incorrect. It would replace the default gzip (or bzip2) with pigz (or pbzip2) for the entire system. While the parallel implementations are remarkably similar to the single process versions, subtle differences in command line options could break core system processes who depend on those differences.

There are many compression algorithms around, and bzip2 is one of the slower ones. Plain gzip tends to be significantly faster, at usually not much worse compression. When speed is the most important, lzop is my favourite. Poor compression, but oh so fast.