The way to think about the clustering design

8 views
Skip to first unread message

Sea Kaban

unread,
Nov 3, 2010, 1:56:41 AM11/3/10
to dremel
This post is created for the folks wanting to join the effort in
creating sound clustering design.
It is hard to start building such complicated system from the first
attempt. So I suggest
to create other, well defined problem, which will give us vital
understanding of the Dremel clustering.
We will take famous distributed word count problem - the classical
example in most explanations about the Map-Reduce. (You can read about
it here http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html)
So now the task we have - how to implement distributed word count, on
the cluster of computers. The main requirement (aside of counting
itself)
is low latency. It should be no more then 1 second.
Please respond to this mail with brief explanation - how would you
implement such task. Feel absolutely free about tools, libraries.
Regards,
David

Evgeny B

unread,
Nov 3, 2010, 10:07:15 AM11/3/10
to dremel
Here is my first idea:
The text file/s are divided into equal chunks and per-allocated on
worker machines.
The master machine sends a signal (query) to every worker. The workers
read files and count appearances of words in a map (hash table?).
When a worker done, it pushes the map entries to the master. The
master merges the maps as it receives them (the reduce phase).
Reply all
Reply to author
Forward
This conversation is locked
You cannot reply and perform actions on locked conversations.
0 new messages