Data produced as output using Dumbo not getting balanced.

13 views
Skip to first unread message

Harshvardhan Solanki

unread,
Jun 27, 2015, 2:09:51 AM6/27/15
to dumbo...@googlegroups.com
The node from which I run Dumbo commands,all the files produced as an output to the Dumbo command are produced in the same node.For example to say ,suppose there is a node having name hvs and I ran the script:

dumbo start matrix2seqfile.py -input hdfs://hm1/user/trainf1.csv -output hdfs://hm1/user/train_hdfs5.mseq -numreducetasks 25 -hadoop $HADOOP_INSTALL 

I ran the above scipt from the node hvs.

When I observe my file system,I found that all the files produced are accumulated in node hvs.
Ideal situation when the files get distributed throughout the cluster.Data is not getting balanced throughout the cluster.

The sanpshot which I have attached has 25 parts ,all of which belong to node hvs

How to fix the above situation?


Screenshot from 2015-06-15 16:18:58.png
Reply all
Reply to author
Forward
0 new messages