matrix streaming algorithm in Spark

Skip to first unread message

Dec 20, 2013, 11:10:34 AM12/20/13

I am a newbie to Spark and want to implement some matrix streaming algorithms in Spark. Here is a basic description. Suppose the matrix is stored in plain text in HDFS and its size is huge (that RAM cannot hold the entire matrix). Each time I only want to load a line from the matrix (i.e. a row) and process it (e.g. applying FFT). So we only have touch the matrix for once (1 pass) while using very small amount of space in RAM. How am I able to implement this simple framework in Pyspark (or Scala shell) ? Thanks.


Reynold Xin

Dec 20, 2013, 1:44:07 PM12/20/13
Definitely you can stream it.

You just need to do 

val rdd = sc.textFile("hdfs://....") { line =>
  // do whatever you want

As long as you don't place any "cache" or "persist" calls to rdd, the data gets streamed through using an iterator.

You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
For more options, visit

Reply all
Reply to author
0 new messages