Thanks for the various suggestions.
Based on feedback that RDDs' memory usage is causing this, I tried following 5 experiments on same dataset. The blockmanager UI observed is attached.
I observed map occurring on different workers due to some rdd.union I was trying. I eliminated that, now the map occurs on same worker between iterations.
There is plenty of free memory on worker. I tried different machines and upto five worker with same result.
If my RDD contains data < 100 char, I get subsecond latency. But my usecase contains data with 4K chars.
--Observations---
For the same simplified implemention, with 2 worker, 130K elements in RDD:
Exp1 Mem cache: For RDD<V>, V contains 4000 chars each, using rdd.cache()
Avg (of ten iterations) job time 10.4 sec; In blockmgr UI RDD memorysize 1098MB. Worker top memoryusage ~30%
Exp2 Serialize: V contains 4000 chars each, using rdd.persist(StorageLevel.MEMORY_ONLY_SER()), with KryoRegistrator
Avg (of ten iterations) job time 10.8 sec; RDD memorysize 542MB
Exp3 No rdd cache: V contains 4000 chars each, rdd.cache() not called
Avg (of ten iterations) job time 10.5 sec; In blockmgr UI RDD memorysize 0MB
Exp4 Replication: V contains 4000 chars each, using rdd.persist(StorageLevel.MEMORY_ONLY_2())
Avg (of ten iterations) job time 10.2 sec;
Exp5 : Same as Exp1 except V contains 18 chars each
Avg (of ten iterations) job time 575 msec;
Appreciate your feedbacks,
Sijo