how does disco take in input?

83 views
Skip to first unread message

Carlo C.

unread,
Dec 23, 2009, 3:19:03 PM12/23/09
to Disco-development
Hi,

Can someone explain how disco takes in input? The docs are a little
vague (http://discoproject.org/doc/py/core.html#disco.core.Job). If
you specify a url as input, does the master pull that url and
distribute chunks to the workers or does each worker pull a copy of
the input? Also, what's the disco:// protocol?

Paweł Reef

unread,
Dec 27, 2009, 6:19:34 AM12/27/09
to disc...@googlegroups.com
One worker fetches the whole file from url. If You want multiple
workers to work on a single file, You need to split it by Yourself.

Currently, disco://cnode03/bigtxt/file_name is an alias for
http://cnode03:[DISCO_PORT]/bigtxt/file_name. You could use put a file
on one of the nodes and use it as input for a job on other nodes.

Luis Sisamon

unread,
Jan 17, 2010, 2:08:29 AM1/17/10
to Disco-development
Would it be possible to modify Disco to, for example, read compressed
files?


On Dec 27 2009, 3:19 pm, Paweł Reef <reefia...@gmail.com> wrote:
> One worker fetches the whole file from url. If You want multiple
> workers to work on a single file, You need to split it by Yourself.
>

> Currently, disco://cnode03/bigtxt/file_name is an alias forhttp://cnode03:[DISCO_PORT]/bigtxt/file_name. You could use put a file

Paweł Reef

unread,
Jan 20, 2010, 5:02:36 PM1/20/10
to disc...@googlegroups.com
You could implement a map_reader capable of reading entries in any
format (compressed too).
See map_reader parameter here:
http://discoproject.org/doc/py/core.html#disco.core.Job

orac....@gmail.com

unread,
Jan 9, 2014, 6:09:30 AM1/9/14
to disc...@googlegroups.com
Im not sure, if I understand this correctly. In word_count example, input is url: http://discoproject.org/media/text/chekhov.txt

So, data is read in chunks using HTTP GET and passed directly to map_reader without pushing it to DDFS. Input file size is not a problem, because file is read in chunks. This procedure reads file over HTTP, line by line, the same as it would be read from DDFS. 

Is this the way disco reads file over HTTP?



Dne sreda, 20. januar 2010 23:02:36 UTC+1 je oseba Reef napisala:

orac....@gmail.com

unread,
Jan 9, 2014, 9:42:48 AM1/9/14
to disc...@googlegroups.com
In count_words.py example Disco gets following URL on input: http://discoproject.org/media/text/chekhov.txt

If I understand this correctly, firstly, Disco makes HTTP GET request with a file URL and gets in response a chunk of a file. Chunk is passed directly to map_readers and it is not stored to DDFS. This process repeats till end of a file. Because Disco is reading file chunk by chunk, there is no upper file size limit.  

Dne sreda, 20. januar 2010 23:02:36 UTC+1 je oseba Reef napisala:
You could implement a map_reader capable of reading entries in any

Jens Rantil

unread,
Jan 10, 2014, 6:02:45 PM1/10/14
to disc...@googlegroups.com
Hi Orac,

The file is not read in chunks. It is read contiguously using a single GET request done by a single map instance.

This means that the job will have not have any concurrency. If you'd like to have concurrency you would want to give the job multiple HTTP input URLs.

I hope this answers your question,
Jens

Sent from Mailbox for iPhone


--
You received this message because you are subscribed to the Google Groups "Disco-development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to disco-dev+...@googlegroups.com.
To post to this group, send email to disc...@googlegroups.com.
Visit this group at http://groups.google.com/group/disco-dev.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages