reduce() reduces all of the items in an RDD into a single value; it does not return an RDD, which is why you're seeing a type error.
If you want to find the total number of 1's in the whole dataset, I would extract the values then perform a sum. This will be pipelined, so it should be efficient:
JavaPairRDD<String, Integer> Molecules = null;
JavaRDD<Integer> values = Molecules.map(new Function<Tuple2<String, Integer>, Integer>() {
@Override
public Integer call(Tuple2<String, Integer> obj) throws Exception {
return obj._2();
}
});
Integer total = values.reduce(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer a, Integer b) {
return a + b;
}
});
A recent pull request added keys() and values() methods (and I'll add aliases for these in the Java API), so in the next release you should be able to write this as:
JavaPairRDD<String, Integer> Molecules = null;
Integer total = Molecules.values().reduce(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer a, Integer b) {
return a + b;
}
});
- Josh