I'd like to add support for gzipped inputs to 0.2.3 which you've been
working on. I was reading your patch at
http://github.com/erans/disco/commit/569f3e90f142a13584d093119caec5d983dfbfcc
and I noticed some issues:
1) is_file_gzip() is run for every input which is really costly, given
that most files are not compressed. I ran a small test case and it seems
that opening a file with is_file_gzip() is about 8x slower than the
normal case (37ms vs. 5ms for 1000 files).
2) decode_local_input() is called by open_local() which means that gzip
support works only for local files. This causes unexpected results if for
some reason the task fails at the node where its input file is located and
it is re-scheduled to another node. Now when it tries to access the input
from a remote node, it fails because decode_local_input() is not called.
Extending the current decode_local_input() to handle the remote case
requires some tweaking as you can't do fd.seek() with an HTTP connection
(in theory you could request only the last 4 bytes using the byte-range
HTTP header but it gets complicated).
Fortunately, as we discussed earlier
http://groups.google.com/group/disco-dev/browse_thread/thread/3f501daaec0475d7?hl=en
not being able to parse uncompressed file size correctly is not a
show-stopper. The decoder will fail anyways if the stream is corrupted.
In order to solve 1) and 2), I would propose (along the lines of our
earlier discussion) that let's move decode_local_input() and
is_file_gzip() to a separate map_reader function, say decode_compressed().
This nicely solves 1), since it's up to the user to use decompression if
needed. It also solves 2), since map_reader works both in the local and
the remote case. What it comes to uncompressed size, we can just ignore
the issue for now since it's not really required.
We can still keep the beauty of automagic detection of various file
formats. Compressed_reader() could be extended to support bzip2, tar etc.
so the user doesn't have to worry about zillions of different readers.
However, there's a slight problem with this approach: If you
specify map_reader = decode_compressed(), you cannot use your custom
map_reader anymore.
A traditional solution to a problem like this is to allow stackable
decoders. Maybe this case justifies extending the map_reader parameter so
that instead of a single map_reader, you can specify a list of them, i.e.
map_reader = [decode_compressed, my_data_parser]
Stackable readers would work simply so that decode_compressed() returns a
triple, (fd, sze, fname) which is passed to my_data_parser(). Here fd
would be the uncompressed input stream, sze = None (since we don't know
the uncompressed size) and fname is the original filename.
How does the user know which modules are required by decode_compressed,
that is, what to add to required_modules (obviously she knows requirements
of her own code, my_data_parser)? Well, it's pretty much impossible to
know.
I think it should be possible to infer required modules automatically by
running dis.dis(decode_compressed) disassembler and by checking
LOAD_GLOBAL lines.
Eran, what do you think about this plan? In practice, you'd need to move
is_file_gzip() and decode_local_input() to a new function (e.g.
decode_compressed) which should return (fd, None, fname).
I will add support for stackable map_readers. Any good ideas regarding the
inference of module dependencies are welcome. I'll take a look if
dis.dis() could do the job.
Ville
Btw, I'd still like to merge your hashlib patch, if you can separate it
from the gzip changes.
Thanks. I'll merge it soon. I'm hindered by the fact that python2.6 is not
available in Debian (testing) yet.
I was thinking of the class approach as well. However, if it turns out to
be possible to infer module dependencies automatically, so there's no need
to supply a separate function to list them, a plain function should do
the job plus it's much simpler.
Yep. Having to use import in map / reduce functions is somewhat kludgy.
The new map_init and reduce_init functions should solve that issue.
Great. I'll keep you updated.
> Once we finalize our approach here I will make the necessary adjustments to the gzip reader.