Hello Georg and welcome the the beam community!
> I assume that would require changes all over the DDFS code.
A lot of the changes are in the form of replacing a single process
with a set of processes (one for each namespace). For example it does
not make any sense to have a process registered as ddfs_gc. Instead
there should be a process like a ddfs_gc_dict that translates between
a namespace to a process. Fortunately, Erlang is a great language for
such things!
> The namespaces should isolate the tags, so that one tag can only refer to other tags in the same namespace?
Correct. One of the main incentives behind this project is being able
to run multiple independent garbage collections -- one for each
namespace. This will require that each tag only points to the other
tags in the same namespace.
One of the good-to-have requirements of this project is to be backward
compatible. Which means, it is possible to upgrade from the old
versions of Disco to Disco-with-namespaces. This requirement can
probably be relaxed if it is too hard to achieve.
> The description states, that there is already a naive approach to network awareness, and I wonder where I could find the corresponding code.
Currently, Disco will push compute to the data whenever possible.
Which mean, if a node has a replica of a blob, Disco scheduler will do
its best to schedule a task on that node.
Take a look at job_coordinator:do_submit_tasks_in and
fair_scheduler_job:assign_task
> I thought probably measuring the hops via traceroute or the bandwidth and latency via a sample transmission (or more fancy, adjust while operation?) could work.
For this project, we are looking for a way for the administrator to
pre-set the rack-names for the nodes. For example, when adding slave
nodes, another field will be added to show the rack. Another approach
will be to accept a script that will run on each node and return the
rack name of that node. Another good to have feature is a hierarchy of
racks.
This issue is documented in Hadoop in this jira [1] for hadoop.
> I hope my questions are not misplaced here.
There is no better place to ask your questions!
[1]:
https://issues.apache.org/jira/browse/HADOOP-692
Regards.