Single file extraction from .tar.bz file

227 views
Skip to first unread message

Pablo Rozas Larraondo

unread,
Jan 5, 2016, 3:09:06 AM1/5/16
to golang-nuts
I have a working version of a Go program that extracts a specific file inside a tar bzipped file. I'm using the standard library to 

bzf := bzip2.NewReader(file)
tar := tar.NewReader(bzf)

// Iterate through tar archive until the file is found or end of input
for {
// Expensive operation!
header, err := tar.Next() 

...

The problem is that for big files, the process of iterating is quite slow and it takes a long time in the case of extracting a file which is at the end of the tar.

I'm wondering if there are concurrent alternatives to this method that can make use of more than one core to accelerate the file extraction process.

Thanks for your help,
Pablo

thebroke...@gmail.com

unread,
Jan 5, 2016, 5:28:49 AM1/5/16
to golang-nuts
My guess is that you're a limited by bzip2 decompression. Unfortunately, the standard library package implementation is not particularly fast. I wrote a new bzip2 decoder that can decompress approximately 2x faster than the standard library (and is comparable to the C version).

For even faster decompression, there are ways to parallelize the decompression of the bzip2 format, but I am not aware of any Go package that currently does this (it is one I plan on writing eventually).

Aside from decompression, the tar.Reader/Next method reads (and discards) all unread data if the underlying io.Reader does not satisfy an io.Seeker. If the bzip2 file is one that you read often, it may make sense to index the location of all bz2 blocks (usually spaced about 1MB apart). The bz2blocks package allows you to create an index of the bzip2 and produces an io.ReaderAt. You can use a io.SectionReader to convert an io.ReaderAt to an io.Seeker. This will tar.Reader.Next to skip large files much more efficiently.

JT

Manlio Perillo

unread,
Jan 5, 2016, 6:46:30 AM1/5/16
to golang-nuts
Il giorno martedì 5 gennaio 2016 09:09:06 UTC+1, Pablo Rozas-Larraondo ha scritto:
I have a working version of a Go program that extracts a specific file inside a tar bzipped file. I'm using the standard library to 
> [...]

The problem is that for big files, the process of iterating is quite slow and it takes a long time in the case of extracting a file which is at the end of the tar.

I'm wondering if there are concurrent alternatives to this method that can make use of more than one core to accelerate the file extraction process.


Each block of a bzip2 stream can be decoded independently.
The problem is that a tar file is archived as a whole, so finding the start of tar header is not simple.
 

Regards  Manlio

Pablo Rozas-Larraondo

unread,
Jan 5, 2016, 7:25:38 PM1/5/16
to golang-nuts, thebroke...@gmail.com
Thanks for your help, actually I think the idea of generating a location index for the bz2 blocks is a quite nice solution for the problem I'm trying to solve. I'm going to do some research about how to do this.

Cheers,
Pablo 
Reply all
Reply to author
Forward
0 new messages