The way to think about the clustering design

8 views

Skip to first unread message

Sea Kaban

unread,

Nov 3, 2010, 1:56:41 AM11/3/10

to dremel

This post is created for the folks wanting to join the effort in
creating sound clustering design.
It is hard to start building such complicated system from the first
attempt. So I suggest
to create other, well defined problem, which will give us vital
understanding of the Dremel clustering.
We will take famous distributed word count problem - the classical
example in most explanations about the Map-Reduce. (You can read about
it here http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html)
So now the task we have - how to implement distributed word count, on
the cluster of computers. The main requirement (aside of counting
itself)
is low latency. It should be no more then 1 second.
Please respond to this mail with brief explanation - how would you
implement such task. Feel absolutely free about tools, libraries.
Regards,
David

Evgeny B

unread,

Nov 3, 2010, 10:07:15 AM11/3/10

to dremel

Here is my first idea:
The text file/s are divided into equal chunks and per-allocated on
worker machines.
The master machine sends a signal (query) to every worker. The workers
read files and count appearances of words in a map (hash table?).
When a worker done, it pushes the map entries to the master. The
master merges the maps as it receives them (the reduce phase).