Reading .tar(.gz) archives in Julia

570 views
Skip to first unread message

Tomas Lycken

unread,
Jun 10, 2014, 3:37:10 AM6/10/14
to julia...@googlegroups.com
In my thesis, I'm working on a project that produces huge amounts of output in text files - about 25-30GB spread across a million or more files per simulation run. If I compress the files using e.g. `tar --xz --create -f archive.tar.gz tracefiles/` I can reduce the size on disk by a factor 5-6 or even more. I postprocess all this data in Julia, and reading the data files seems to be a major bottleneck.

Has any effort been made toward reading files in these formats in Julia? I've seen [ZipFile](https://github.com/fhs/ZipFile.jl) for handling the .zip format, but unfortunately `zip` isn't available on our cluster, while `tar` is.

If there hasn't been any work on this, I might take a stab at it sometime - but first I must finish my thesis...

// T

Tobias Knopp

unread,
Jun 10, 2014, 4:16:41 AM6/10/14
to julia...@googlegroups.com
I have been using ZipFile.jl and it worked quite well. I am not sure what you mean that "zip" is not available. ZipFile relies on the zlib library and works "in memory" using ccalls.

One issue that I am not sure about is if ZipFile supports zip64 files. There is a TODO here https://github.com/fhs/ZipFile.jl/blob/master/src/ZipFile.jl#L54. Don't know how much work it would be to implement it if missing.

Simon Byrne

unread,
Jun 10, 2014, 4:17:13 AM6/10/14
to julia...@googlegroups.com
If you were happy to just compress/uncompress each text file individually (i.e. not using tar) you could use Gzip.jl:

libtar provides a C interface for some tar functionality, however its worth keeping in mind that tar files are not indexed, and so have very poor random access performance (extracting an individual file requires sequentially searching through the whole tarball). If you want to use tar, you might be better off just untar-ing the whole lot before running your julia job, then tar-ing it all back up again after, in which case you might as well just use the shell commands.

Kevin Squire

unread,
Jun 10, 2014, 5:24:41 AM6/10/14
to julia...@googlegroups.com
Wrapping libxz would actually be quite useful, though (libbz2 as well).  

You might look at GZip.jl or Libz.jl for inspiration--it's quite possible that the library is similarly set up. Clang.jl also might be useful to produce an initial wrapper. 

Cheers, Kevin

gael....@gmail.com

unread,
Jun 10, 2014, 6:45:45 AM6/10/14
to julia...@googlegroups.com
What would *also* be pretty useful is allowing different compression filters in HDF5.jl.

HDF5 compression capabilitied are not limited to Deflate. Blosc, for one, has allowed fast and efficient compression/decompression of data in my case.

Your program would have to be changed to save data in an HDF5 file, but you'll get ASCII -> binary "compression" for free, a file format pretty good for random access and compression filters that allow even further size reduction.

Tomas Lycken

unread,
Jun 11, 2014, 7:21:42 AM6/11/14
to julia...@googlegroups.com
That would indeed be useful, but not to me at the moment - my data comes from a C++ application...

// T

gael....@gmail.com

unread,
Jun 11, 2014, 7:42:32 AM6/11/14
to julia...@googlegroups.com
Just for you to know, there are HDF5 bindings in almost every language I know, C++ included.

Don't take what I said as a suggestion though: it's perfectly fine as you did it. But others might read this thread and say "HDF5? why not, indeed".

Reply all
Reply to author
Forward
0 new messages