GSoC Disco project

38 views
Skip to first unread message

Georg Göri

unread,
Mar 6, 2015, 5:03:52 PM3/6/15
to beam-co...@googlegroups.com
Hi.

I am a computer science student from Austria.
I started using Erlang as part of bachelor thesis, and also based my (in
progress) master thesis on Erlang concepts (for details read the
paper[1], unfortunatly behind a pay wall).

I've been using mostly open source software for the last 8 years, and
always wanted to contributed more intensely to a project.
For some time I was part of the OpenStreetMap community in my home town,
but now I want to contribute in a more technical way.

I am very interested in distributed servers, so I would like to work on
such a project. As freelancer I have already developed a
geospatial-distributed game server in Erlang. Therefore I would like to
work on the Disco project as GSoC student. To get a better understanding
and feel for the code base I am also working on a issue in DDFS [2].

I find the following ideas very interesting:

*Add namespace to DDFS.
I assume that would require changes all over the DDFS code.
Just to clarify: The namespaces should isolate the tags, so that one tag
can only refer to other tags in the same namespace?

*Network-resource-aware task allocator
The description states, that there is already a naive approach to
network awareness, and I wonder where I could find the corresponding code.
Further I wanted to ask if there are already any ideas, how the network
topology can be extracted.
I thought probably measuring the hops via traceroute or the bandwidth
and latency via a sample transmission (or more fancy, adjust while
operation?) could work.

I hope my questions are not misplaced here.


Cheers

Georg





[1] http://dx.doi.org/10.1007/978-3-662-45231-8_2
[2] https://github.com/discoproject/disco/issues/603

Shayan Pooya

unread,
Mar 7, 2015, 1:17:39 AM3/7/15
to Georg Göri, beam-co...@googlegroups.com
Hello Georg and welcome the the beam community!

> I assume that would require changes all over the DDFS code.
A lot of the changes are in the form of replacing a single process
with a set of processes (one for each namespace). For example it does
not make any sense to have a process registered as ddfs_gc. Instead
there should be a process like a ddfs_gc_dict that translates between
a namespace to a process. Fortunately, Erlang is a great language for
such things!

> The namespaces should isolate the tags, so that one tag can only refer to other tags in the same namespace?
Correct. One of the main incentives behind this project is being able
to run multiple independent garbage collections -- one for each
namespace. This will require that each tag only points to the other
tags in the same namespace.

One of the good-to-have requirements of this project is to be backward
compatible. Which means, it is possible to upgrade from the old
versions of Disco to Disco-with-namespaces. This requirement can
probably be relaxed if it is too hard to achieve.


> The description states, that there is already a naive approach to network awareness, and I wonder where I could find the corresponding code.

Currently, Disco will push compute to the data whenever possible.
Which mean, if a node has a replica of a blob, Disco scheduler will do
its best to schedule a task on that node.
Take a look at job_coordinator:do_submit_tasks_in and
fair_scheduler_job:assign_task

> I thought probably measuring the hops via traceroute or the bandwidth and latency via a sample transmission (or more fancy, adjust while operation?) could work.

For this project, we are looking for a way for the administrator to
pre-set the rack-names for the nodes. For example, when adding slave
nodes, another field will be added to show the rack. Another approach
will be to accept a script that will run on each node and return the
rack name of that node. Another good to have feature is a hierarchy of
racks.

This issue is documented in Hadoop in this jira [1] for hadoop.

> I hope my questions are not misplaced here.

There is no better place to ask your questions!

[1]: https://issues.apache.org/jira/browse/HADOOP-692


Regards.

Georg Göri

unread,
Mar 18, 2015, 10:17:54 AM3/18/15
to beam-co...@googlegroups.com
I am now further working on the concept for namespaces and I had a few
questions regarding the intended functionality.

I though it would be a good idea to let the namespace be a  part of the
tag, separated by a character that is not allowed to be part of a
tag-name (slashes would be a good choice, but their is this automagic on
the REST interface that converts the path a/b/c to a:b:c), so I now
propose to use a pipe ("|").
This way most of the ddfs commands do not have to change.
The ddfs ls command could search the a namespace differing from the
default by specifying a "<namespace-name>|" prefix.

Further I think a command like list-namespaces is needed, that would
return all existing namespaces.

Now to the questions.
In the project idea it was stated that one could want hierarchically
namespaces. I don't think that would be so hard to implement, but I see
no gain by allowing them to be hierarchically, while I could also have
the namespaces "a|b" and "a|b|c" (assuming | is legal in a namespaces
name), which would be totally isolated but look hierarchical.
Is their a disadvantage or are any features planned or overlooked by me
that could make use of a real hierarchical structure (besides probably
listing of namespace names)?

Further I thought to proposes a concept for the GC that overall only one
namespace can be GC'ed at a time to keep the load caused by GC'ing under
control. Any opinions on that?



Cheers

Georg

Shayan Pooya

unread,
Mar 18, 2015, 10:48:34 AM3/18/15
to Georg Göri, beam-co...@googlegroups.com
> I though it would be a good idea to let the namespace be a part of the
> tag, separated by a character that is not allowed to be part of a
> tag-name (slashes would be a good choice, but their is this automagic on
> the REST interface that converts the path a/b/c to a:b:c), so I now
> propose to use a pipe ("|").
> This way most of the ddfs commands do not have to change.
> The ddfs ls command could search the a namespace differing from the
> default by specifying a "<namespace-name>|" prefix.

Sounds good. The pipe character has special meaning in the shell
though so there might be better choices. For example, the REST API can
be modified not to convert '/' into ':'


> In the project idea it was stated that one could want hierarchically
> namespaces. I don't think that would be so hard to implement, but I see
> no gain by allowing them to be hierarchically, while I could also have
> the namespaces "a|b" and "a|b|c" (assuming | is legal in a namespaces
> name), which would be totally isolated but look hierarchical.

Hierarchy is not a must have for this project.
For the use cases, I was thinking about writing wrappers around DDFS
namespaces that allow it to be mounted under FUSE or other distributed
filesystems.


> Is their a disadvantage or are any features planned or overlooked by me
> that could make use of a real hierarchical structure (besides probably
> listing of namespace names)?

I assume DDFS without hierarchical namespaces can be extended to
hierarchical namespaces. So definitely avoid hierarchy if they make
things more complicated.

> Further I thought to proposes a concept for the GC that overall only one
> namespace can be GC'ed at a time to keep the load caused by GC'ing under
> control. Any opinions on that?

That could work fine. Although I was thinking about having independent
namespaces that can GC concurrently. You have the freedom of making
design decisions that you think would make things simpler.



Regards.
Reply all
Reply to author
Forward
0 new messages