RE The way to think about the clustering design

Evgeny B

unread,

Nov 3, 2010, 10:13:40 AM11/3/10

to dremel

Here is my first idea:
The text file/s are divided into equal chunks and per-allocated on
worker machines.
The master machine sends a signal (query) to every worker. The workers
read files and count appearances of words in a map (hash table?).
When a worker done, it pushes the map entries to the master. The
master merges the maps as it receives them (the reduce phase).

Sea Kaban

unread,

Nov 4, 2010, 2:16:45 AM11/4/10

to dremel

Lets start from this design. We will implement it, run on cluster of
Amazon EC2 instances, measure performance and learn all possible
lessons.

In the same time I would like to relate to the above design in number
of perspectives:
Scalability:
It is viable Idea for the small-to-medium set of reliable computers.
We assume that one central computer is strong
enough to merge results from other computers (lets call them leafs) in
the resonable time.
In fact - merge process also consumes CPU, and in some scale central
computer will also became a bottleneck. So we need to address it in
our design.

Reliability: In the big-enough scale some failures will happens
frequently enough to prevent system from completing many queries
successfully. So
we should improve our design in the way, that system will survive
failure of any particular node. We need to achive a kind of inta-query
failover. It mean that we
do not want to fail the whole query because of one node failure.
Instead we want to recompute minimum required information and get the
result.

Multi-tenancy - the system will be used by the many users, and we
should schedule the work in some logical manner.

David

Vladik

unread,

Nov 4, 2010, 4:12:57 AM11/4/10

to dremel

Can we divide merge stage ?
I am thinking about having leafs multilevel hierarchy with master leaf
doing reduce for the pair and then transferring it's results to it's
master and so on.
We can have master leaf redundancy where each slave will have 2 or
more master leafs.
At the end we will have single master merger that will merge all sub
merges. My assumption is that it's less CPU intensive, also I think
this can be potentially more fault tolerant

Sea Kaban

unread,

Nov 5, 2010, 9:57:01 AM11/5/10

to dremel

To have hierarchy of the merging servers - is very scalable approach.
Actually it what
we have to do, since our bible saying that we have to have tree based
execution. The trickier question - is how to know what
part of work should be redone in case of the fauilure of one of the
intermidieate servers?
Another question - do we have to build this hierarchy ad-hoc for each
query, or it should be static configuration?

Oleg Gibayev

unread,

Nov 6, 2010, 7:39:54 AM11/6/10

to opend...@googlegroups.com

I think that idea about to divide merge stage is right way, so our nodes will have piggyback functionality. This approach really diminishes master's load but as Sea Kaban rightly have pointed out,what will be with intermediate data in case if server has fault ? I think here we can use GFS's opportunity - replicating data regions. Since we have decided to use tree hierarchy i suggest to organise our tree as self-balanced tree, for example AVL tree, where each node will have some complex weight (including server's load factor, etc). In case if server has some problems, it will be removed from the tree and tree will be rebalanced and during the rebalancing procedure all nodes which reside below (in tree hierarchy) are notified about structure's changes. When server is repaired it will be injected into the tree structure with max. weight allowing it to discharge its neighbors.
Let discuss about it :)

BR
Annoying orange

Evgeny B

unread,

Nov 8, 2010, 8:24:50 AM11/8/10

to dremel

Since the text length tends to grow faster than the vocabulary, I
would guess that we can start from the naive suggestion that the merge
time is relatively small even if all leafs complete simultaneously (so
the tree height of 2 is sufficient), can't we?
If, not, there are many options, e. g. first N servers that complete
their task will multicast an advertisement that they are ready to
become mergers. It will be easier to discuss the tree configuration
having some benchmarks.

Reliability: every leaf can report progress to its neighbor leaf
(variants: 2 neighbors, 3, 4, 5 in a logical ring). The neighbor(s)
will hold a copy of the text of the reporting leaf (1/2 of text, 1/3,
1/4, 1/5). If the leaf fails the neighbor(s) will pick up its task.

Reply all

Reply to author

Forward