Examples with SparkContext.hadoopRDD

cearl

unread,

Mar 20, 2012, 6:49:43 AM3/20/12

to Spark Users

Hi,
I was wondering if you had examples using the non-standard
InputFormats with SparkContext.hadoopRDD?
Thanks

Markus Stenberg

unread,

Mar 20, 2012, 8:29:14 AM3/20/12

to spark...@googlegroups.com

Here's anonymized example from some code I've been working on..

def getRawX(sc: SparkContext) = {

val t = classOf[MongoInputFormat]

val conf = new Configuration

conf.set("mongo.input.uri", "mongodb://localhost/X_database.Y_collection")

// Default 8MB splits are fairly .. small .., as we're dealing

// with 20+GB database. Let's try with 200M chunks

conf.setInt("mongo.input.split_size", 200)

new NewHadoopRDD(sc, t,

classOf[Object], classOf[BSONObject],

conf)

}

I'm using NewHadoopRDD as Hadoop-Mongo uses the post-0.20 API.. (and after this, I need to deal with the BSONObject ugliness, but that's Mongo-specific I suppose.)

Cheers,

-Markus

余雷

unread,

Mar 20, 2012, 8:45:52 AM3/20/12

to spark...@googlegroups.com

Hi, i'm curious that : could newRDD method turns a matrix(2-dim array) to a RDD?

-----------------------------------------------------------------------------

多从自己身上找原因，严于律己，宽以待人。

cearl

unread,

Mar 20, 2012, 9:28:43 AM3/20/12

to Spark Users

Thanks Markus,
So if I understand, NewHadoopRDD for .20 and above Hadoop?
Charles

Markus Stenberg

unread,

Mar 21, 2012, 6:49:01 AM3/21/12

to spark...@googlegroups.com

tiistaina 20. maaliskuuta 2012 15.28.43 UTC+2 cearl kirjoitti:

Thanks Markus,
So if I understand, NewHadoopRDD for .20 and above Hadoop?

Well, I'm not quite sure where the API change happened, vaguely remember it being 0.20.something. Hadoop versioning is scary, as 1.0 = 0.20.something, yet the branch it's based is on is like from 2009, that is, ancient.

You can determine it from which mapreduce module the code uses.. org.apache.hadoop.mapred is the old (depracated) API, and org.apache.hadoop.mapreduce is the more recent API. Similarly, by looking at the Hadoop version you use, if hadoop.mapreduce doesn't exist, it is old version which uses the old API ;)