mongo-hadoop - Usage of mongo.input.query for loading data from mongo to hadoop

utkar...@gmail.com

unread,

Sep 14, 2015, 4:33:58 PM9/14/15

to mongodb-user

I am using mongo-hadoop to load data from monogo to my spark job an RDD.
https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage

This my my current config:
        Configuration mongodbConfigRestaurantSetup = new Configuration();
        mongodbConfigRestaurantSetup.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");
        mongodbConfigRestaurantSetup.set("mongo.input.uri", props.getProperty("mongo_uri"));
        mongodbConfigRestaurantSetup.set("mongo.input.split_size", "200");
        mongodbConfigRestaurantSetup.set("mongo.input.query", "{\"MyId\":{\"$in\":[" + listOfIds + "]}}");

My collections has >10M documents (listOfIds) but my spark jobs needs to work on a subset of that data, say 1M IDs or maybe just 1 for testing.
But when I load the data, mongo-hadoop loads all the documents and then applies the query on that dataset, which is not very efficient.

Is it a technical limitation or is there a suggested workaround for this?
Also, looks like some else had a similar issue: http://codeforhire.com/2014/02/18/using-spark-with-mongodb/comment-page-1/#comment-853

Thanks,
-Utkarsh

Luke Lovett

unread,

Sep 15, 2015, 1:13:05 PM9/15/15

to mongodb-user

This query is run on MongoDB, not Spark. However, the query is run after calculating the splits (Spark partitions). There isn't a way to filter documents from a collection before the splitting happens. This comment from the same page explains this issue a bit.

The inefficient part of this process isn't that all documents are loaded into memory (they shouldn't be); it's that some of the splits/partitions may be empty, because the query can return no results for some splits. There's possibly also some extra work being done as part of the splitting process. This issue is currently being addressed for the next release of the connector and is tracked in this ticket: https://jira.mongodb.org/browse/HADOOP-83.

I think the next comment on that page has a good recommendation-- if you just want a few documents for testing your Spark application, just copy a small subset of the documents from your real collection to a new "test" collection.

utkar...@gmail.com

unread,

Sep 15, 2015, 1:47:56 PM9/15/15

to mongodb-user

Is there a programmatic way to create a temp collection?
Something like: "CREATE TABLE test AS SELECT * FROM myExistingTable" in SQL?

Luke Lovett

unread,

Sep 15, 2015, 2:11:24 PM9/15/15

to mongodb-user

There are a number of ways to do this in the mongo shell or any driver. For example, using the aggregation framework:

db.collection.aggregate([{$match: <query>}, {$limit: <sample-size>}, {$out: <temp-collection-name>}])

Utkarsh Sengar

unread,

Sep 15, 2015, 7:57:12 PM9/15/15

to mongod...@googlegroups.com

Unfortunately I am running mongo version 2.4.12 and I get this error:

Error("Printing Stack Trace")@:0
()@src/mongo/shell/utils.js:37
([object Array])@src/mongo/shell/collection.js:866
@(shell):1

uncaught exception: aggregate failed: {
    "errmsg" : "exception: Unrecognized pipeline stage name: '$out'",
    "code" : 16436,
    "ok" : 0
}

Suggestions?

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/xBuXHwZhtTY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/4c88fd97-fade-4b6b-905b-40a275bc097e%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.