Reading multiple files into single RDD

946 views
Skip to first unread message

Laeeq Ahmed

unread,
Mar 12, 2013, 4:50:50 AM3/12/13
to spark...@googlegroups.com
Hi all,

What will be an efficient way to read multiple files into RDD from HDFS? I have 100 files each of about 560MB. Right now I have a loop over rdd and then perform different actions.

for(int i=0;i<=99;i++)
                {
                        //Basic RDD takes one file at a time
                       JavaPairRDD<LongWritable, Text> predictionFile = jsc.newAPIHadoopFile(hdfs://192.168.1.130:54310/home/admin/predictionfiles/ + "6_p0." + i + ".sdf", SDFInputFormat.class, LongWritable.class, Text.class, conf);
}
With this I think I don't get many mappers 560/128 = 5 mappers. One way would be to merge these files into bigger files or decreasing HDFS block size. I was just thinking there might be some way to read all these files at once and then apply operations like map, filter etc.

Regards,

Laeeq

Reynold Xin

unread,
Mar 12, 2013, 5:24:37 AM3/12/13
to spark...@googlegroups.com
You can do

JavaPairRDD<LongWritable, Text> predictionFile = jsc.newAPIHadoopFile(hdfs://192.168.1.130:54310/home/admin/predictionfiles/6_p0.*"

--
Reynold Xin, AMPLab, UC Berkeley



--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Laeeq Ahmed

unread,
Mar 12, 2013, 6:13:53 AM3/12/13
to spark...@googlegroups.com
Thanks Xin for your support.

Archit Thakur

unread,
Dec 16, 2013, 5:06:59 AM12/16/13
to spark...@googlegroups.com





Hi,

What if the files belong to different directory on hdfs file system?

Thanks and Regards,
Archit Thakur.

Reynold Xin

unread,
Dec 16, 2013, 1:51:11 PM12/16/13
to spark...@googlegroups.com
You can use RDD.union to union multiple RDDs together. 
Reply all
Reply to author
Forward
0 new messages