// I am using "break" for the main loop in reduce function (iterating over "edgeList")
var rdd = sc.parallelize(List(1, 2, 3, 4))
var i = 0
while(i < 3) {
val result = rdd.flatMap{x => Thread.sleep(1000); List(x)}
rdd = result.map(x => x * 10)
println("Count: " + rdd.count)
i += 1
}
var rdd = sc.parallelize(List(1, 2, 3, 4))
var i = 0
while(i < 3) {
val result = rdd.flatMap{x => Thread.sleep(1000); List(x)}
rdd = result.map(x => x * 10).cache()
println("Count: " + rdd.count)
i += 1
}
How many times is "Count: 4" printed in Example A and in Example B?
About how long does the final iteration of the while loop take in Example A and in Example B?
3 and 3
12 seconds and 4 seconds
| Graph Nodes | Execution Time | Number of Iteration | Data Size |
| 17 nodes | 4 secs | 2 | 4.9 KB |
| 50 nodes | 5 secs | 3 | 40.5 KB |
| 100 nodes | 6 secs | 2 | 179 KB |
| 500 nodes | 10 secs | 3 | 4.35 MB |
| 1000 nodes | 28 secs | 3 | 17.4 MB |
| 5000 nodes | 7 mins, 49 secs | 3 | 435 MB |
|
10000 nodes |
1 hr, 14 mins | 3 | 1.7 GB |
I have used KryoSerializer and and re-executed the code. But its still not helping increasing the performance issue. As pointed in the above mails, I feel that its the shuffle time only which is more. Might be something can be changed in the framework to reduce the shuffle time.
I have attached my code on which I am working. Some one can use this code to run in his cluster and see the performance. Please verify if the code can be optimized to increase the performance or its something in the Spark framework because of which it is taking time.
This is a small dataset. I am unable to upload big data sets due to size issue. You may use some big graph data if you have.
The input format is:
nodeId Edge1:Weight, Edge2:Weight, ...
A sample graph will be like this:
1 2:4.0,3:12.0,4:5.0
2 1:4.0,3:7.0,4:8.0
3 1:12.0,2:7.0,4:2.0
4 1:5.0,2:8.0,3:2.0
Thanks,
Gaurav