I am using mongo-hadoop to load data from monogo to my spark job an RDD.
https://github.com/mongodb/mongo-hadoop/wiki/Spark-UsageThis my my current config:
Configuration mongodbConfigRestaurantSetup = new Configuration();
mongodbConfigRestaurantSetup.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");
mongodbConfigRestaurantSetup.set("mongo.input.uri", props.getProperty("mongo_uri"));
mongodbConfigRestaurantSetup.set("mongo.input.split_size", "200");
mongodbConfigRestaurantSetup.set("mongo.input.query", "{\"MyId\":{\"$in\":[" + listOfIds + "]}}");
My collections has >10M documents (listOfIds) but my spark jobs needs to work on a subset of that data, say 1M IDs or maybe just 1 for testing.
But when I load the data, mongo-hadoop loads all the documents and then applies the query on that dataset, which is not very efficient.
Is it a technical limitation or is there a suggested workaround for this?
Also, looks like some else had a similar issue:
http://codeforhire.com/2014/02/18/using-spark-with-mongodb/comment-page-1/#comment-853Thanks,
-Utkarsh