Here's anonymized example from some code I've been working on..
def getRawX(sc: SparkContext) = {
val t = classOf[MongoInputFormat]
val conf = new Configuration
conf.set("mongo.input.uri", "mongodb://localhost/X_database.Y_collection")
// Default 8MB splits are fairly .. small .., as we're dealing
// with 20+GB database. Let's try with 200M chunks
conf.setInt("mongo.input.split_size", 200)
new NewHadoopRDD(sc, t,
classOf[Object], classOf[BSONObject],
conf)
}
I'm using NewHadoopRDD as Hadoop-Mongo uses the post-0.20 API.. (and after this, I need to deal with the BSONObject ugliness, but that's Mongo-specific I suppose.)
Cheers,
-Markus