How to read multiple files from a HDFS directory in Spark Streaming...

3,953 views
Skip to first unread message

goutam tadi

unread,
Apr 26, 2013, 9:28:49 AM4/26/13
to spark...@googlegroups.com
I am unable to read multiple files from a directory in Spark Streaming.

I have been using

val tweets=ssc.fileStream("hdfs://localhost:8020/user/hdfs/sparkinput/")
tweets.print();


When I try to run this...

I see no output on the console.Help me in this regard.

Sijo

unread,
Apr 26, 2013, 10:35:15 AM4/26/13
to spark...@googlegroups.com

Try ""hdfs://localhost:8020/user/hdfs/sparkinput/*"
That works for spark api (not sure of Stream)

Tathagata Das

unread,
Apr 26, 2013, 5:41:30 PM4/26/13
to spark...@googlegroups.com
The way fileStream works is that by default it will process only new files that are created in the directory you specified. So if you run Spark Streaming program with fileStream and not create any new files, you may not see any output on screen. 

Here are some of the things that you can try. 

1. Verify that the directory actually exists, and that you are creating files in the directory.
2. Just using fileStream without specify the types of generics K, V F may not work. I would strongly suggest you use textFileStream() and generate plain text files to see whether it works.
3. There may be some issue with your HDFS setup (permissions, etc.). Try running the textFileStream with a local filesystem directory and see whether it works.

Keep me updated on what you find. 

TD


--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Reply all
Reply to author
Forward
0 new messages