Hi Imran,
Just saw your pull request on Accumulable. I like the idea of extending the interface to minimize object churn. I'll get back to you with some comments on the code later, but I think it's the best solution for this use case.
> If I understand correctly, reduceByKey will work well when there is one type of thing you're accumulating. However, if you want to groups things in different ways, then you need to do multiple passes over the data, right? Eg., imagine that your processing documents from wikipedia, and you want to simultaenously get:
> * distribution of words per document
> * sets of a special type of extracted entity (eg., names of locations that score high on some custom function)
> * total count of citations
Yup, this is kind of annoying right now. In the longer term I want to add some way to batch multiple actions and automatically launch a job with all of them. Something like:
rdd.reduceLater(reduceFunction1) // returns Future[ResultType1]
rdd.reduceLater(reduceFunction2) // returns Future[ResultType2]
SparkContext.force() // executes all the "later" operations as part of a single optimized job
However, in the shorter term, I think accumulators are the easiest way to do this, especially with the fixes you have to reduce GC.
Yes, the applications we've had that used it used mapPartitions. It would look roughly like this:
val points = // RDD of Point objects
val gradient = points.mapPartitions { partitionPoints =>
val localGradient = new Vector(…)
for (point <- partitionPoints) {
localGradient += // math for p
}
Iterator(localGradient) // return an iterator that gives just one number
}.reduce(_ + _) // add up the results of all the partitions
That is, you take an RDD[Point] into an RDD[Vector] with one vector per partition (the local gradient), and then you call reduce on that to do the final sum.
Unfortunately I don't have any sample code that does this right now, but if you try it out, it would be cool to add it into the examples folder.
Matei