Hi ,
I am thinking of creating a new RDD to process log files located in local file systems across multiple machines.
My Plan to run spark on all these nodes and i want a RDD which processes the log files on each node. My input will be list of machine names and i want to partitions to be based on machine name / ip address and i want to make sure each partition runs on the exact machine ( i.e. partition m1 to be scheduled to run only on m1 , not anywhere else. i.e. NODE_LOCAL strictly followed).
My question is, the existing HadoopRDD RDD can be used for this ?. What flag or parameter should i pass to force NODE_LOCAL. If i have to write a new RDD for this (i.e. takes list of machine names as input and process files on each machine parallel)
Thanks
Sathish