I have re-write the code like this :
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ListBuffer
object test {
def main(args: Array[String]) {
val sc = new SparkContext("local", "SparkTest")
// use val, not var - this value doesn't need to change once initialised
// similarly, ListBuffer is a mutable structure - and pointless given that we
// parallelize it to an RDD anyway
val input = Seq( //let scalac infer the type
1 -> 5.0,
2 -> 2.0,
3 -> 3.0,
4 -> 6.0 // a->b is shorthand for the tuple (a,b)
)
val inputRdd = sc.parallelize(input) //val again
// we map straight to the output, so no need to pre-initialise anything
// or use intermediate collections
//val temp:RDD[List[(Int,Double)]] = RDD[List[(Int,Double)]]()
val temp = List[(Int,Double)]()
var clampedList = sc.parallelize(temp)
clampedList ++= clamp(inputRdd)
clampedList ++= clamp(clampedList)
// use collect() to get the data (as an Array) back out of RDD form
println("Output \n--")
//println(clamped1.collect().mkString)
//println(clamped2.collect().mkString)
clampedList.foreach(println)
System.exit(0)
}
def printIt(xs: RDD[(Int,Double)]) = xs.collect().mkString
// tupled is a very useful method, converting a function that takes 2 args into a function
// that takes a tuple
def clamp(xs: RDD[(Int,Double)]): RDD[(Int,Double)] = xs.map({case (x,y) => mfunc(x,y)}).reduceByKey(_+_)
//def clamp(xs: RDD[(Int,Double)]): RDD[(Int,Double)] = xs.map(mfunc.tupled).reduceByKey(_+_)
// always strive for the simplest code, and AVOID VARS!
// also, it's faster to stick with integer arithmetic
def mfunc(a: Int,b: Double):(Int, Double) = ((a/2) + (a%2)) -> b
}
If I understand correctly the clampedList in this case is : [(1,7.0),(2,9.0),(1,16.0)]. ( Btw,t how can I access only the first or only the second pair in the RDD ?)
Instead, I would like to be able to have a list or vector or array where when I type the first element of this array to give me (1,7.0),(2,9.0)
and the second (or last in this case) element the value (1,16.0).
Thank you.