matrix streaming algorithm in Spark

80 views
Skip to first unread message

richard...@gmail.com

unread,
Dec 20, 2013, 11:10:34 AM12/20/13
to spark...@googlegroups.com
Hi,

I am a newbie to Spark and want to implement some matrix streaming algorithms in Spark. Here is a basic description. Suppose the matrix is stored in plain text in HDFS and its size is huge (that RAM cannot hold the entire matrix). Each time I only want to load a line from the matrix (i.e. a row) and process it (e.g. applying FFT). So we only have touch the matrix for once (1 pass) while using very small amount of space in RAM. How am I able to implement this simple framework in Pyspark (or Scala shell) ? Thanks.


Richard

Reynold Xin

unread,
Dec 20, 2013, 1:44:07 PM12/20/13
to spark...@googlegroups.com
Definitely you can stream it.

You just need to do 

val rdd = sc.textFile("hdfs://....")
rdd.map { line =>
  // do whatever you want
}

As long as you don't place any "cache" or "persist" calls to rdd, the data gets streamed through using an iterator.





--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages