Read Secondary Not Working to Mongodb Shard for Spark MongoInputFormat

Rendy Bambang Junior

unread,

Feb 18, 2016, 10:45:20 PM2/18/16

to mongodb-user

I have a Spark job (PySpark) running a query to mongo. I intended to fire the query to secondary.

I have a mongo input uri like below

'mongo.input.uri': 'mongodb://'+host+':27017/'+dbName+'.'+collection+'?readPreference=secondary'

As per my understanding, having readPreference=secondary as option passed to mongo input uri is the way to make it read from secondary (ref: https://github.com/mongodb/mongo-hadoop/wiki/Configuration-Reference)

However, when I run the job, I see spike on mongo fault at my primary nodes monitoring. When I checked the log at primary nodes, it is confirmed that query run against primary instead of secondary.

What did I do wrong? Did I wrongly placed the configuration? Is it only working for normal replica set, and not for shard?

Note: Mongodb setting is sharded with version 2.6.X

Wan Bachtiar

unread,

Feb 26, 2016, 12:15:24 AM2/26/16

to mongodb-user

when I run the job, I see spike on mongo fault at my primary nodes monitoring. When I checked the log at primary nodes, it is confirmed that query run against primary instead of secondary.

Hi Rendy,

I’ve just tested readPreference=secondary behaviour and it worked as expected. The test environment :

MongoDb v2.6.11 sharded cluster.
mongo-java-driver-3.2.2.jar
mongo-hadoop-core-1.5.0-rc0.jar
spark-1.5.1 on hadoop 2.6

The scala config settings:

mongoConfig.set("mongo.input.uri", "mongodb://<mongos>:<port>/<dbname>.<collection>?readPreference=secondary")
mongoConfig.set("mongo.input.query", "{'field': {'$gte': 100} }")

To view the query operations, the db.setProfilingLevel() was set to 2 for displaying all operations. See Profiling levels for more info.

By specifying the readPreference=secondary option in the mongodb URI, the below query in the primary doesn’t return any result:

db.system.profile.find({ns:"dbname.collection"}).pretty()

However in the secondary the query above shows results for the query operation. Note the $readPreference mode:

{
    "op" : "query",
    "ns" : "dbname.collection",
    "query" : {
        "$min" : {
            "_id" : ObjectId("56cfb5ce7d9d2ff96a8dc820")
        },
        "$max" : {
        },
        "$orderby" : {
        },
        "$readPreference" : {
            "mode" : "secondary"
        },
        "$query" : {
            "field" : {
                "$gte" : 100
            }
        }
    },
   ...

When the readPreference=secondary option was removed from the mongodb URI. Executing the same query above will show the other way around. i.e. results show up in the primary, but not in the secondary. Below is the output from the primary, note the missing $readPreference mode.

{
    "op" : "query",
    "ns" : "dbname.collection",
    "query" : {
        "$min" : {
        },
        "$max" : {
            "_id" : ObjectId("56cfb5ce7d9d2ff96a8dc820")
        },
        "$orderby" : {
        },
        "$query" : {
            "field" : {
                "$gte" : 100
            }
        }
    },
    ...

The spike of activity in your primary nodes maybe related to the collStats call if you are using mongo-java-driver early version of 3.0.x. Where the getStats() helper does not respect the DBCollection read preference, the issue was resolved in v3.0.4. See JAVA-1921 and HADOOP-220 for more info.

If you are still having difficulty with read preference on secondary, could you provide the following:

Version of mongo-java-driver jar.
Version of mongo-hadoop-core jar.
Method that you used to confirm the queries running on the primary, also an example of the query.
Whether the collection input is sharded or not.

You may also find hadoop connector commands useful.

Kind regards,

Wan.

Zong Chang

unread,

Feb 28, 2018, 1:05:42 PM2/28/18

to mongodb-user

Hi Wan,

I encountered the same problem when I tried to use mongo-hadoop to create a hive table to query mongodb data.

The version of my mongodb and mongo-java-driver are both 3.4.3, the version of my mongo-hadoop-core and mongo-hadoop-hive is 2.0.2.

My mongodb connection string is: mongodb://<myhost>:<myport>/<db>.<collection>?readPreference=secondary

And when I executed query to select data, I got error like:

Failed with exception java.io.IOException:java.io.IOException: com.mongodb.MongoNotPrimaryException: The server is not the primary and did not execute the operation

As what I have investigated, this bug should have been fixed since mongo-java-driver 3.0.4 version, but I still got this error in 3.4.3.

So any hint for this situation?

Many thanks,

Chang

在 2016年2月26日星期五 UTC+8下午1:15:24，Wan Bachtiar写道：

Wan Bachtiar

unread,

Mar 14, 2018, 1:53:06 AM3/14/18

to mongodb-user

I encountered the same problem when I tried to use mongo-hadoop to create a hive table to query mongodb data.

Hi Chang,

Please open a new thread with the following information:

MongoDB server version
MongoDB deployment topology (standalone, replica set, or sharded cluster)
Apache Hive version
How are you creating the hive table ? i.e. example command
What would happen if you remove readPreference=secondary from the URI ?

Regards,
Wan.

Reply all

Reply to author

Forward