Data Locality - How to force it to run only on the node where the data it available..STRICT NODE_LOCAL.

714 views
Skip to first unread message

Sathish Senathi

unread,
Aug 6, 2013, 5:55:21 AM8/6/13
to spark-de...@googlegroups.com
Hi ,

I am thinking of creating a new RDD to process log files located in local file systems across multiple machines.

My Plan to run spark on  all these nodes and i want a RDD which processes the log files on each node. My input will be list of machine names and i want to partitions to be based on machine name / ip address and i want to make sure each partition runs on the exact machine ( i.e. partition m1 to be scheduled to run only on m1 , not anywhere else. i.e. NODE_LOCAL strictly followed).

My question is, the existing HadoopRDD RDD can be used for this ?. What flag or parameter should i pass to force NODE_LOCAL. If i have to write a new RDD for this (i.e. takes list of machine names as input and process files on each machine parallel)

Thanks
Sathish

Reynold Xin

unread,
Aug 6, 2013, 4:46:41 PM8/6/13
to spark-de...@googlegroups.com
It might be hard to use HadoopRDD because you don't have a distributed file system. The driver wouldn't be able to read the file system metadata for those log files. 

You can however create a new RDD class to do this. Just look at how HadoopRDD is written. 

To enforce locality, change the spark.locality.wait property. By default it only waits for 3 secs (3000). You can get it to wait much longer, effectively forcing 100% locality. 

--
Reynold Xin, AMPLab, UC Berkeley



Sathish

--
You received this message because you are subscribed to the Google Groups "Spark Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-develope...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Evan Chan

unread,
Aug 7, 2013, 9:05:05 PM8/7/13
to spark-de...@googlegroups.com
You can also use HadoopRDD with your own InputFormat, which assigns locality speciifcally to the nodes with the data that you want.

We have a custom InputFormat and HadoopRDD does not need HDFS at all.

Granted writing an RDD is probably easier.  Also I thought there was an API for assigning locality.
Reply all
Reply to author
Forward
0 new messages