Running a jodob in the cluster

Harihara Vinayakaram

unread,

Aug 15, 2010, 8:54:13 AM8/15/10

to disc...@googlegroups.com

Hi

I was able to set up my cluster following the tutorial . I was able to run a job locally . Now I want to set up a cluster . I tried following the tutorial and then I did the distrfiles.py which copied the split files to the various nodes. But now I get an error saying "http://hadoop3:8989/testdata/bigtxt-aw" Cannot access file . The file is present under $DISCO_ROOT/data directory but I cannot access using http

How do I make it accessible ?

Regards

Hari

Harihara Vinayakaram

unread,

Aug 15, 2010, 12:17:07 PM8/15/10

to Disco-development

[map:28] Traceback (most recent call last):
File "/usr/local/bin//disco-worker", line 19, in <module>
main(sys.argv[5:], *sys.argv[1:5])
File "/usr/local/bin//disco-worker", line 14, in main
jobname=jobname)
File "/usr/local/lib/python2.6/dist-packages/disco/node/worker.py",
line 69, in __init__
self.task.run()
File "/usr/local/lib/python2.6/dist-packages/disco/task.py", line
246, in run
self._run()
File "/usr/local/lib/python2.6/dist-packages/disco/task.py", line
270, in _run
reader, sze, url = self.connect_input(self.inputs[0])
File "/usr/local/lib/python2.6/dist-packages/disco/task.py", line
200, in connect_input
ret = input_stream(fd, sze, url, self.params)
File "/usr/local/lib/python2.6/dist-packages/disco/func.py", line
457, in map_input_stream
return mod.input_stream(stream, size, url, params)
File "/usr/local/lib/python2.6/dist-packages/disco/schemes/
scheme_disco.py", line 18, in input_stream
return comm.open_remote('http://%s/%s/%s' % (netloc, prefix,
fname))
File "/usr/local/lib/python2.6/dist-packages/disco/comm.py", line
62, in open_remote
conn = Conn(urlresolve(url))
File "/usr/local/lib/python2.6/dist-packages/disco/comm.py", line
74, in __init__
self.read(1)
File "/usr/local/lib/python2.6/dist-packages/disco/comm.py", line
85, in read
b = self.read_chunk(n)
File "/usr/local/lib/python2.6/dist-packages/disco/comm.py", line
110, in read_chunk
offset = (self.offset, end))
File "/usr/local/lib/python2.6/dist-packages/disco/comm.py", line
41, in download
code, body = real_download(urlresolve(url), **kwargs)
File "/usr/local/lib/python2.6/dist-packages/disco/comm_httplib.py",
line 42, in download
raise CommError("Transfer %s failed: %s" % (url, e), url)
CommError: Unable to access resource (http://hadoop3:8989/testdata/
bigtxt-bl): Transfer http://hadoop3:8989/testdata/bigtxt-bl failed:

Harihara Vinayakaram

unread,

Aug 16, 2010, 12:39:12 PM8/16/10

to Disco-development

Hi
Any options that I can proceed with ? Thanks

Regards
Hari

On Aug 15, 5:54 pm, Harihara Vinayakaram <hvr...@gmail.com> wrote:

Jared Flatow

unread,

Aug 16, 2010, 1:22:21 PM8/16/10

to disc...@googlegroups.com

Hi Hari,

Where did you get that url from? Try http://hadoop3:8989/disco/bigtxt-aw if the file is directly in the $DISCO_ROOT/data directory.

The recommended solution over using distrfiles.py would be to push the files to ddfs and use the tag as an input for your jobs.

jared

> --
> You received this message because you are subscribed to the Google Groups "Disco-development" group.
> To post to this group, send email to disc...@googlegroups.com.
> To unsubscribe from this group, send email to disco-dev+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/disco-dev?hl=en.
>

Harihara Vinayakaram

unread,

Aug 16, 2010, 8:54:49 PM8/16/10

to disc...@googlegroups.com

Thanks Jared .
I followed the tutorial which says
1) use distryfiles.py >x
2) python test1.py disco://hadoop1 `cat /tmp/x`

I was able to access files using the URL you mentioned . Should the tutorial be revamped for using ddfs ? I can do that if I know where to start from

Regards
Hari

Chris Mueller

unread,

Aug 17, 2010, 11:20:35 AM8/17/10

to Disco-development

I just ran into the same problem that Hari ran into..here's what I was
able to figure out...

In the tutorial examples, the URLs generated by distrfiles.py are of
the form:
disco://compute-0-3/bigfile/bigfile_aa

Disco converts them to:
http://compute-0-3/bigfile/bigfile_aa
before requesting the file from a remote node.

Based on Jared's note and some experimentation, it looks like they
should be converted to:
http://compute-0-3/disco/bigfile/bigfile_aa

Indeed, this does work. I can telnet into any node and access files
using URLs formatted this way.

Going through the code, however, it looks like internal disco urls
always have to be of the form:
http://compute-0-3/key/value

and that it's not possible to do:
http://compute-0-3/disco/key/value

lib/disco/task.py:276 seems to be the code that prevents it by
splitting file requests into a key and value. Three values breaks the
code.

So... given that, is it possible to access files in the data directory
using the input argument on new_jobs? If so, what is the input
format? This seems to be the sticking point with the tutorial code -
the format output by distrfiles.py isn't converted into proper
requests for disco data.

I'm going to try ddfs today and see if it avoids the problem. But,
for completeness it would be nice to know how access files in /data.

Thanks!

-Chris

Harihara Vinayakaram

unread,

Aug 19, 2010, 1:33:04 AM8/19/10

to disc...@googlegroups.com

Hi

ddfs avoids the problem. The URL's generated by the ddfs.get looks like

disco://hadoop4/ddfs/vol0/blob/89/nohup_out$502-2e36f-e8f67

So I guess the URLs starting from /disco to access file system and /ddfs access the DDFS data

Regards

Hari

--

Reply all

Reply to author

Forward