Hi,
I was
running a simple map reduce job (stratosphere-0.5-rc2) in ibm cluster (9
slaves) on 7GB twitter dataset for extracting only unique vertex ids<src> from edges file<src,tar>.
The job is executing for more than 3 hours and even after it is not
completing.
What would be the approximate estimated time for running on 10 node
cluster for around 7GB data? Kindly help me to find the the problem in my case here? Here is my simple user defined map and reduce functions:
Map:
public Tuple2<Long, Long> map(String value) throws Exception {
String[] array = value.split(this.delim);
Tuple2<Long,Long> emit = new Tuple2<Long,Long>();
emit.f0 = Long.parseLong(array[0]);
emit.f1 = Long.parseLong(array[1]);
return emit;
}
Reduce:
public void reduce(Iterator<Tuple2<Long, Long>> values,
Collector<Long> out) throws Exception {
Long srcKey = values.next().f0;
out.collect(srcKey);
}
Java main method:
DataSet<String> text = env.readTextFile(inputfilepath);
DataSet<Long> result = text.map(new TextMapper(fieldDelimiter)).groupBy(0).reduceGroup(new Reducer());
result.writeAsText(outputfilepath, WriteMode.OVERWRITE);
env.execute();
Thanks,
Janani