I'm wondering what sort of effort would be required to make Scoobi support outputting a bzipped file.
I.e. instead of
persist(TextOutput.toTextFile(nicely_formatted, Opts.output))
I'd like to be able to say
persist(TextOutput.toBZipFile(nicely_formatted, Opts.output))
or something of the sort, so that the file automatically gets compressed on output.
The reason is that I'm working with really large data sets (say around 500 GB compressed, which would expand to maybe 4.2 TB uncompressed, since the compression ratio for this data is about 8.4). I want to be able to do transformations on the whole data set but I don't necessarily have space to store the uncompressed stuff.
How hard would it be to add a simple version of this support (i.e. just supporting compressed text files), and could someone outline the steps involved (more or less), since I'm not too familiar with Scoobi internals?
It seems that to properly support this would require some concept of output-file transformations, so that I might say
persist(TextOutput.toBZipFile(nicely_formatted, BZipOutput.toBZipFile(Opts.output)))
so that I can potentially create bzipped or gzipped files in any format, not just text. (Seems like this might be a good idea anyway, for increased modularity.)
ben