tags and URLs

92 views
Skip to first unread message

Pavel Hančar

unread,
May 18, 2013, 11:36:42 AM5/18/13
to disc...@googlegroups.com
 Hello,
I have a tag named "test" with data stored through ddfs chunk. This works:
$ ddfs xcat test

but,
$ ddfs urls test
disco://nymfe24/ddfs/vol0/blob/35/al_1000MB-0$558-d89c0-83c9d   disco://nymfe41/ddfs/vol0/blob/35/al_1000MB-0$558-d89c0-83c9d
...
$ ddfs tag newtag disco://nymfe24/ddfs/vol0/blob/35/al_1000MB-0$558-d89c0-83c9d
$ ddfs xcat newtag
Unable to access resource ([u'disco://nymfe24/ddfs/vol0/blob/35/al_1000MB-058-d89c0-83c9d']): Exhausted all available replicas, last error was:

Traceback (most recent call last):
  File "/packages/run.64/disco-0.4.4/lib/disco/worker/__init__.py", line 470, in swap
    self.iter = skip(self.open(next(self.urls)), self.last + 1)
  File "/packages/run.64/disco-0.4.4/lib/disco/worker/classic/worker.py", line 376, in open
    return ClassicFile(url, streams, params)
  File "/packages/run.64/disco-0.4.4/lib/disco/worker/classic/worker.py", line 384, in __init__
    fd = stream(fd, size, url, *maybe_params)
  File "/packages/run.64/disco-0.4.4/lib/disco/worker/classic/func.py", line 425, in task_input_stream
    return schemes.input_stream(stream, size, url, params, globals=globals())
  File "/packages/run.64/disco-0.4.4/lib/disco/schemes/__init__.py", line 35, in input_stream
    return input_stream(stream, size, url, params)
  File "/packages/run.64/disco-0.4.4/lib/disco/schemes/scheme_disco.py", line 19, in input_stream
    file = open(url, task=globals().get('Task'))
  File "/packages/run.64/disco-0.4.4/lib/disco/schemes/scheme_disco.py", line 13, in open
    return comm.open_url(util.urljoin((scheme, netloc, path)))
  File "/packages/run.64/disco-0.4.4/lib/disco/comm.py", line 104, in open_url
    return open_remote(url, *args, **kwargs)
  File "/packages/run.64/disco-0.4.4/lib/disco/comm.py", line 110, in open_remote
    return Connection(urlresolve(url), token)
  File "/packages/run.64/disco-0.4.4/lib/disco/comm.py", line 146, in __init__
    self.read(1)
  File "/packages/run.64/disco-0.4.4/lib/disco/comm.py", line 172, in read
    bytes = self._read_chunk(size if size > 0 else CHUNK_SIZE)
  File "/packages/run.64/disco-0.4.4/lib/disco/comm.py", line 192, in _read_chunk
    headers=headers)
  File "/packages/run.64/disco-0.4.4/lib/disco/comm.py", line 80, in request
    raise CommError(response.read(), url, status)
CommError: Unable to access resource (http://nymfe24:8989/ddfs/vol0/blob/35/al_1000MB-058-d89c0-83c9d): Not found. (404)

What's the problem? Did I misunderstood something?
 Pavel Hančar

Pavel Hančar

unread,
May 18, 2013, 11:48:31 AM5/18/13
to disc...@googlegroups.com
I forgot to say: my goal is to see the content of one URL. I wrote a reader for ddfs chunk -R ...  and I want to see, if it works properly.
  Pavel


2013/5/18 Pavel Hančar <pavel....@gmail.com>

Prashanth Mundkur

unread,
May 18, 2013, 2:43:28 PM5/18/13
to disc...@googlegroups.com
On 17:48 Sat 18 May, Pavel Hančar wrote:
> I forgot to say: my goal is to see the content of one URL. I wrote a reader
> for ddfs chunk -R ... and I want to see, if it works properly.

IIRC, you will need to use the same reader in xcat, also specified via '-R'.

--prashanth

Pavel Hančar

unread,
May 18, 2013, 7:08:07 PM5/18/13
to disc...@googlegroups.com
Thanks, but it is probably not the case. The reader I wrote yields chunks of a text from one XML tag to the next (<doc ...>...</doc>). I'd like to see if it works, which means (I hope), that every content of a URL (disco://...) is in the form <doc...>...</doc>.
  But the second question is what I misunderstood with the ddfs urls and ddfs tag commands. It is not reader dependent, because (as you can see above), the "Not found. (404)" error appeared without any -R argument.



2013/5/18 Prashanth Mundkur <prashant...@gmail.com>

--prashanth

--
You received this message because you are subscribed to the Google Groups "Disco-development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to disco-dev+...@googlegroups.com.
To post to this group, send email to disc...@googlegroups.com.
Visit this group at http://groups.google.com/group/disco-dev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.



Pavel Hančar

unread,
May 20, 2013, 7:40:03 PM5/20/13
to disc...@googlegroups.com
Or how to write a reader, with not splitted "<doc>...</doc>" elements? I tried this:

def to_doc_reader( fd, size, url, params ) :
    text = fd.readline()
    while True:
        line = fd.readline()
        if line == '':
            break
        i = line.find("<doc")
        if i == -1:
           text += line
        else:
            fd.seek( - len(line) + i, 1)
            text += line[:i]
            yield text
            text = fd.readline()
    yield text

But it seems to slow down the upload a lot and I don't know, if it is correct (as I wrote before).
Pavel


2013/5/19 Pavel Hančar <pavel....@gmail.com>

Pavel Hančar

unread,
May 21, 2013, 12:39:06 PM5/21/13
to disc...@googlegroups.com
So,
the answer is: The dollar signs in URLs must be escaped! :)
The reader is correct and I appologize, it doesn't slow down the upload. Only I was surprised, the ddfs chunk is so slower than hadoop dfs -put.
 Best Wishes,
  Pavel


2013/5/21 Pavel Hančar <pavel....@gmail.com>

Prashanth Mundkur

unread,
May 22, 2013, 5:11:36 PM5/22/13
to disc...@googlegroups.com
On 18:39 Tue 21 May, Pavel Hančar wrote:

> So, the answer is: The dollar signs in URLs must be escaped! :) The
> reader is correct and I appologize, it doesn't slow down the upload.
> Only I was surprised, the ddfs chunk is so slower than hadoop dfs
> -put.

Note that ddfs chunk is providing you the ability to chunk on record
boundaries that you define; hadoop dfs -put doesn't provide that. The
cost you pay is doing your record parsing in Python during upload,
which slows it down. However, this is a one time cost; once uploaded,
your data can be processed multiple times w/o dealing with record
splitting issues, unlike hadoop.

--prashanth

Pavel Hančar

unread,
May 25, 2013, 1:13:38 PM5/25/13
to disc...@googlegroups.com
Thank you for the answer,
but I am curious, who puts the second and third replicas in Disco? Does it run in parallel as in case of hadoop replica pipelining?
  Pavel


2013/5/22 Prashanth Mundkur <prashant...@gmail.com>

--prashanth

Prashanth Mundkur

unread,
May 26, 2013, 12:09:40 PM5/26/13
to disc...@googlegroups.com
On 19:13 Sat 25 May, Pavel Hančar wrote:
> Thank you for the answer,
> but I am curious, who puts the second and third replicas in Disco? Does it
> run in parallel as in case of hadoop replica pipelining?

At the time of upload, the client (i.e. the 'ddfs push' command)
itself creates the additional replicas, in parallel. This is unlike
hadoop replica pipelining, where afaik the client uploads one replica
and the hadoop system creates the others.

--prashanth

Jens Rantil

unread,
May 26, 2013, 5:47:22 PM5/26/13
to disc...@googlegroups.com
However, it should also be noted that the master will make sure that
the numbers of replicas eventually will converge according to the
master configuration.

Jens

Sent from my iPhone 5S

Pavel Hančar

unread,
Jun 1, 2013, 7:55:00 AM6/1/13
to disc...@googlegroups.com
 That's what I meant: Hadoop client places only one replica and the others are placed in parallel by slaves. Disco client instead places all three replicas.
  I tried to compare the upload of Disco and Hadoop for my master thesis and my results are these:
Vložený obrázek 2
I emphasize the time is log scaled. It means DDFS is 10x slower than HDFS. Thank you Prashanth for the explanation about record awareness, but I think 10x is too much. The BDFS is my simple script (no replicas, no master/slave architecture), but it has also record aware upload.
  Or do you think I've done some stupid mistake to have such results? If not, I have two hypotheses why DDFS is so slower, but I don't know it properly, so I can't be sure. The first is the replica placement mentioned above. The second is that I suspect Disco of reading and parsing all the input file for record borders. The quicker way for record awareness is probably to find block borders by skipping e.g. 64MB with seek, reading only to the first record border and skipping again. Then you can send the found blocks without parsing and that's what BDFS does.
  So, what do you think?
  Pavel
 



2013/5/26 Jens Rantil <jens....@gmail.com>
upload_graph.png

Prashanth Mundkur

unread,
Jun 1, 2013, 7:42:58 PM6/1/13
to disc...@googlegroups.com
Thanks Pavel for sharing your results!

On 13:55 Sat 01 Jun, Pavel Hančar wrote:

> I emphasize the time is log scaled. It means DDFS is 10x slower than
> HDFS. Thank you Prashanth for the explanation about record
> awareness, but I think 10x is too much. The BDFS is my simple script
> (no replicas, no master/slave architecture), but it has also record
> aware upload.

If I understand correctly, here 'DDFS chunk' refers to the use of the
chunk reader you'd posted earlier. It should be also useful to add
non-record-aware 'DDFS chunk' to the graph; that would be a fairer
comparison to the HDFS upload. I expect it to still be much slower
due to python, but it would still be a fairer (apples/apples)
comparison.

> Or do you think I've done some stupid mistake to have such
> results? If not, I have two hypotheses why DDFS is so slower, but I
> don't know it properly, so I can't be sure. The first is the replica
> placement mentioned above. The second is that I suspect Disco of
> reading and parsing all the input file for record borders. The
> quicker way for record awareness is probably to find block borders
> by skipping e.g. 64MB with seek, reading only to the first record
> border and skipping again. Then you can send the found blocks
> without parsing and that's what BDFS does.

Can you ensure that DDFS is using pycurl for the upload?

$ python
>>> import pycurl

If it is not, could you measure (for future reference) what
performance difference using pycurl makes versus not using pycurl?

Does the mean that your BDFS script uses a different block boundary
algorithm than the DDFS chunk script's reader? IIRC, that reader just
did a string.find(), which is quite inefficient in Python. You might
be able to speed it up by using the Python regex library. It would be
best to ensure that your BDFS and your DDFS reader use the same record
boundary finding algorithm.

I'm not sure what 'BDFS parallel' means.

Thanks,

--prashanth

Prashanth Mundkur

unread,
Jun 1, 2013, 7:55:04 PM6/1/13
to disc...@googlegroups.com
On 16:42 Sat 01 Jun, Prashanth Mundkur wrote:

> IIRC, that reader just did a string.find(), which is quite
> inefficient in Python.

I'm wrong about this:

http://svn.python.org/view/python/trunk/Objects/stringlib/fastsearch.h?view=markup&pathrev=77470

indicates you might not be able to do better using the regex library.

--prashanth

Prashanth Mundkur

unread,
Jun 1, 2013, 9:42:21 PM6/1/13
to disc...@googlegroups.com
On 13:55 Sat 01 Jun, Pavel Hančar wrote:

> If not, I have two hypotheses why DDFS is so slower, but I don't
> know it properly, so I can't be sure. The first is the replica
> placement mentioned above.

I doubt this will be an issue, especially if you use pycurl.

> The second is that I suspect Disco of reading and parsing all the
> input file for record borders.

Indeed, this is a major issue, since by default Disco tries to chunk
on '\n' boundaries even if you don't specify a reader.

You have two options to regain performance:

One is a custom chunker, which you can plug in here:
https://github.com/discoproject/disco/blob/master/lib/disco/ddfs.py#L139
Unfortunately, the Disco input/output stream code is not well
documented, and quite tightly coupled in various places, so hard to
modify.

> The quicker way for record awareness is probably to find block
> borders by skipping e.g. 64MB with seek, reading only to the first
> record border and skipping again. Then you can send the found blocks
> without parsing and that's what BDFS does. So, what do you think?

The other much easier option is to modify your above BDFS to directly
talk to DDFS, using DDFS's Rest API:
https://disco.readthedocs.org/en/latest/howto/ddfs.html#web-api

Again, for a fair comparison, you might need to compare with HDFS's
web API:
http://hadoop.apache.org/docs/r1.0.4/webhdfs.html#CREATE

--prashanth

Pavel Hančar

unread,
Jun 2, 2013, 6:10:15 PM6/2/13
to disc...@googlegroups.com
 Thank you for the answers,
I am not sure about the pycurl. My master node can import pycurl, but not all the slaves. Does it mean I didn't use pycurl?
  Just to be clear: BDFS is a bash script. At the beginning it runs a python script computing the record aware offsets of chunks. Then the "BDFS sequential"
runs something like cat $INPUT_FILE | for node in $NODES; do ssh node ...; done
  while "BDFS parallel" runs parallel processes on the chunks; something like 
for node in $NODES; do tail -c+$from $INPUT_FILE | head -c $size | ssh $node ... & done
    BDFS sequential is slower by sequential ssh login.
    Sorry for so much details, but now you can fully understand the graph.
    And yes, with this attitude I didn't use the reader mentioned above, because I don't need to search all the input file.
    Pavel



2013/6/2 Prashanth Mundkur <prashant...@gmail.com>

--prashanth

Prashanth Mundkur

unread,
Jun 4, 2013, 2:07:10 AM6/4/13
to disc...@googlegroups.com
On 00:10 Mon 03 Jun, Pavel Hančar wrote:
> Thank you for the answers,

> I am not sure about the pycurl. My master node can import pycurl,
> but not all the slaves. Does it mean I didn't use pycurl?

For this experiment, you'd need it only at the client, i.e. the
machine that is running your scripts.

> Just to be clear: BDFS is a bash script. At the beginning it runs a
> python script computing the record aware offsets of chunks. Then the "BDFS
> sequential"
> runs something like cat $INPUT_FILE | for node in $NODES; do ssh node ...;
> done
> while "BDFS parallel" runs parallel processes on the chunks; something
> like
> for node in $NODES; do tail -c+$from $INPUT_FILE | head -c $size | ssh
> $node ... & done
> BDFS sequential is slower by sequential ssh login.
> Sorry for so much details, but now you can fully understand the graph.
> And yes, with this attitude I didn't use the reader mentioned above,
> because I don't need to search all the input file.

Thanks for the explanation. One option could be to implement your
bash script in python, and then instead of calling ssh $node, just use
http://discoproject.org/doc/disco/lib/ddfs.html#disco.ddfs.DDFS.push

--prashanth

Pavel Hančar

unread,
Jun 5, 2013, 9:13:00 AM6/5/13
to disc...@googlegroups.com
 Thanks,
I ran it on the master. So the computation on the graph used the pycurl.
 Pavel



2013/6/4 Prashanth Mundkur <prashant...@gmail.com>

--prashanth

Pavel Hančar

unread,
Nov 27, 2013, 3:47:29 PM11/27/13
to disc...@googlegroups.com
 At last I came back to this issue. Above Prashanth noticed "non-record-aware 'DDFS chunk'". Please, is it actually possible? If I understand well, the chuker is always record aware, it seems to me to default to line records.
  Thanks,
  Pavel


2013/6/5 Pavel Hančar <pavel....@gmail.com>
Reply all
Reply to author
Forward
0 new messages