Gzip support

Ville Tuulos

unread,

Jul 28, 2009, 4:56:51 AM7/28/09

to eran.s...@gmail.com, disc...@googlegroups.com

Hi Eran,

I'd like to add support for gzipped inputs to 0.2.3 which you've been
working on. I was reading your patch at

http://github.com/erans/disco/commit/569f3e90f142a13584d093119caec5d983dfbfcc

and I noticed some issues:

1) is_file_gzip() is run for every input which is really costly, given
that most files are not compressed. I ran a small test case and it seems
that opening a file with is_file_gzip() is about 8x slower than the
normal case (37ms vs. 5ms for 1000 files).

2) decode_local_input() is called by open_local() which means that gzip
support works only for local files. This causes unexpected results if for
some reason the task fails at the node where its input file is located and
it is re-scheduled to another node. Now when it tries to access the input
from a remote node, it fails because decode_local_input() is not called.

Extending the current decode_local_input() to handle the remote case
requires some tweaking as you can't do fd.seek() with an HTTP connection
(in theory you could request only the last 4 bytes using the byte-range
HTTP header but it gets complicated).

Fortunately, as we discussed earlier

http://groups.google.com/group/disco-dev/browse_thread/thread/3f501daaec0475d7?hl=en

not being able to parse uncompressed file size correctly is not a
show-stopper. The decoder will fail anyways if the stream is corrupted.

In order to solve 1) and 2), I would propose (along the lines of our
earlier discussion) that let's move decode_local_input() and
is_file_gzip() to a separate map_reader function, say decode_compressed().

This nicely solves 1), since it's up to the user to use decompression if
needed. It also solves 2), since map_reader works both in the local and
the remote case. What it comes to uncompressed size, we can just ignore
the issue for now since it's not really required.

We can still keep the beauty of automagic detection of various file
formats. Compressed_reader() could be extended to support bzip2, tar etc.
so the user doesn't have to worry about zillions of different readers.

However, there's a slight problem with this approach: If you
specify map_reader = decode_compressed(), you cannot use your custom
map_reader anymore.

A traditional solution to a problem like this is to allow stackable
decoders. Maybe this case justifies extending the map_reader parameter so
that instead of a single map_reader, you can specify a list of them, i.e.

map_reader = [decode_compressed, my_data_parser]

Stackable readers would work simply so that decode_compressed() returns a
triple, (fd, sze, fname) which is passed to my_data_parser(). Here fd
would be the uncompressed input stream, sze = None (since we don't know
the uncompressed size) and fname is the original filename.

How does the user know which modules are required by decode_compressed,
that is, what to add to required_modules (obviously she knows requirements
of her own code, my_data_parser)? Well, it's pretty much impossible to
know.

I think it should be possible to infer required modules automatically by
running dis.dis(decode_compressed) disassembler and by checking
LOAD_GLOBAL lines.

Eran, what do you think about this plan? In practice, you'd need to move
is_file_gzip() and decode_local_input() to a new function (e.g.
decode_compressed) which should return (fd, None, fname).

I will add support for stackable map_readers. Any good ideas regarding the
inference of module dependencies are welcome. I'll take a look if
dis.dis() could do the job.

Ville

Btw, I'd still like to merge your hashlib patch, if you can separate it
from the gzip changes.

Eran Sandler

unread,

Jul 28, 2009, 6:08:42 AM7/28/09

to disc...@googlegroups.com

Hi Ville,

Thanks for the comments.

I've recreated and sent you a patch for the use of hashlib where available (as opposed to md5 which issues a warning on Python 2.6 - and I think even on 2.5).

Regarding chaining the readers, I agree, its a much more powerful thing.

The only useful solution I thought of regarding reader was to use a class. You'll have the reader function (possibly as a static method using the staticmethod decorator because you don't need the "self" parameter in the function) and a function to return a list of dependencies.

The code can support getting both a function as a reader or that special class (which should probably inherit from a base class so it will be easier to detect inside the code).

That way a reader can become a rather independent unit which contains its code and its dependencies. These dependencies will go into the required_modules.

The less useful solution would be to have the import statements inside the reader code, like one would do in the map and reduce functions if they don't use the required_modules parameter.

Comments?

Once we finalize our approach here I will make the necessary adjustments to the gzip reader.

Eran

Ville Tuulos

unread,

Aug 2, 2009, 10:44:24 PM8/2/09

to disc...@googlegroups.com

On Tue, 28 Jul 2009, Eran Sandler wrote:

> Hi Ville,
> Thanks for the comments.
>
> I've recreated and sent you a patch for the use of hashlib where available (as opposed to md5
> which issues a warning on Python 2.6 - and I think even on 2.5).

Thanks. I'll merge it soon. I'm hindered by the fact that python2.6 is not
available in Debian (testing) yet.

> Regarding chaining the readers, I agree, its a much more powerful thing.
>
> The only useful solution I thought of regarding reader was to use a class. You'll have the reader
> function (possibly as a static method using the staticmethod decorator because you don't need the
> "self" parameter in the function) and a function to return a list of dependencies.

I was thinking of the class approach as well. However, if it turns out to
be possible to infer module dependencies automatically, so there's no need
to supply a separate function to list them, a plain function should do
the job plus it's much simpler.

> The code can support getting both a function as a reader or that special class (which should
> probably inherit from a base class so it will be easier to detect inside the code).
>
> That way a reader can become a rather independent unit which contains its code and its
> dependencies. These dependencies will go into the required_modules.

Classes are complex animals in Python. I'd rather keep it simple, if
possible by any means.

> The less useful solution would be to have the import statements inside the reader code, like one
> would do in the map and reduce functions if they don't use the required_modules parameter.

Yep. Having to use import in map / reduce functions is somewhat kludgy.
The new map_init and reduce_init functions should solve that issue.

> Once we finalize our approach here I will make the necessary adjustments to the gzip reader.

Great. I'll keep you updated.

Ville

Eran Sandler

unread,

Aug 3, 2009, 3:36:02 AM8/3/09

to disc...@googlegroups.com

On Mon, Aug 3, 2009 at 5:44 AM, Ville Tuulos <tuu...@gmail.com> wrote:

Thanks. I'll merge it soon. I'm hindered by the fact that python2.6 is not
available in Debian (testing) yet.

Use Ubuntu 9.04 instead ;-)

I was thinking of the class approach as well. However, if it turns out to
be possible to infer module dependencies automatically, so there's no need
to supply a separate function to list them, a plain function should do
the job plus it's much simpler.

Fine by me.

Yep. Having to use import in map / reduce functions is somewhat kludgy.
The new map_init and reduce_init functions should solve that issue.

Yeah, that would be great. So when a map function is running instead of running the import code for every mappable item that is received from the reader, it will call the init when the worker starts the work.

> Once we finalize our approach here I will make the necessary adjustments to the gzip reader.

Great. I'll keep you updated.

Ping me once the chaining of readers is done and I'll commit the gzip (and bzip2) reader.

Eran

Reply all

Reply to author

Forward