Examples with SparkContext.hadoopRDD

1,275 views
Skip to first unread message

cearl

unread,
Mar 20, 2012, 6:49:43 AM3/20/12
to Spark Users
Hi,
I was wondering if you had examples using the non-standard
InputFormats with SparkContext.hadoopRDD?
Thanks

Markus Stenberg

unread,
Mar 20, 2012, 8:29:14 AM3/20/12
to spark...@googlegroups.com
Here's anonymized example from some code I've been working on..

  def getRawX(sc: SparkContext) = {
    val t = classOf[MongoInputFormat]
    val conf = new Configuration
    conf.set("mongo.input.uri", "mongodb://localhost/X_database.Y_collection")

    // Default 8MB splits are fairly .. small .., as we're dealing
    // with 20+GB database. Let's try with 200M chunks
    conf.setInt("mongo.input.split_size", 200)

    new NewHadoopRDD(sc, t,
                     classOf[Object], classOf[BSONObject], 
                     conf)
}

I'm using NewHadoopRDD as Hadoop-Mongo uses the post-0.20 API.. (and after this, I need to deal with the BSONObject ugliness, but that's Mongo-specific I suppose.)

Cheers,

-Markus

余雷

unread,
Mar 20, 2012, 8:45:52 AM3/20/12
to spark...@googlegroups.com
Hi, i'm curious that : could newRDD method turns a matrix(2-dim array) to a RDD?


-----------------------------------------------------------------------------
多从自己身上找原因,严于律己,宽以待人。

cearl

unread,
Mar 20, 2012, 9:28:43 AM3/20/12
to Spark Users
Thanks Markus,
So if I understand, NewHadoopRDD for .20 and above Hadoop?
Charles

Markus Stenberg

unread,
Mar 21, 2012, 6:49:01 AM3/21/12
to spark...@googlegroups.com
tiistaina 20. maaliskuuta 2012 15.28.43 UTC+2 cearl kirjoitti:
Thanks Markus,
So if I understand, NewHadoopRDD for .20 and above Hadoop?

Well, I'm not quite sure where the API change happened, vaguely remember it being 0.20.something. Hadoop versioning is scary, as 1.0 = 0.20.something, yet the branch it's based is on is like from 2009, that is, ancient.

You can determine it from which mapreduce module the code uses.. org.apache.hadoop.mapred is the old (depracated) API, and org.apache.hadoop.mapreduce is the more recent API. Similarly, by looking at the Hadoop version you use, if hadoop.mapreduce doesn't exist, it is old version which uses the old API ;)

Cheers,

-Markus 
Reply all
Reply to author
Forward
0 new messages