Replica Set -Java driver - Primary Failover

1,394 views
Skip to first unread message

RajAhm

unread,
Oct 28, 2010, 5:41:14 PM10/28/10
to mongodb-user
I have set up replica set with 3 Mongod server on localhost.

I am testing failover from primary. I am using Java Driver version
2.2

I have written an application and wanted to test how Java driver
behaves during failover and I am puzzled to say a least.

As soon as I CTRL-C to kill primary mongod server, I see in eclipse
following messages..

===
ARNING: node down: 171.69.120.229:27018 java.io.IOException: couldn't
connect to [/171.69.120.229:27018] bc:java.net.ConnectException:
Connection refused: connect
Oct 28, 2010 2:33:38 PM com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: 171.69.120.229:27018 java.io.IOException: couldn't
connect to [/171.69.120.229:27018] bc:java.net.ConnectException:
Connection refused: connect
Oct 28, 2010 2:33:44 PM com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: 171.69.120.229:27018 java.io.IOException: couldn't
connect to [/171.69.120.229:27018] bc:java.net.ConnectException:
Connection refused: connect
Oct 28, 2010 2:33:50 PM com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: 171.69.120.229:27018 java.io.IOException: couldn't
connect to [/171.69.120.229:27018] bc:java.net.ConnectException:
Connection refused: connect
Oct 28, 2010 2:33:56 PM com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: 171.69.120.229:27018 java.io.IOException: couldn't
connect to [/171.69.120.229:27018] bc:java.net.ConnectException:
Connection refused: connect
Oct 28, 2010 2:34:02 PM com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: 171.69.120.229:27018 java.io.IOException: couldn't
connect to [/171.69.120.229:27018] bc:java.net.ConnectException:
Connection refused: connect
Oct 28, 2010 2:34:08 PM com.mongodb.ReplicaSetStatus$Node update
==================================================

Then I try to persist some JSON records and I get following error


Oct 28, 2010 2:28:33 PM com.mongodb.DBTCPConnector$MyPort error
SEVERE: MyPort.error called
com.mongodb.MongoInternalException: DBPort.findOne failed
at com.mongodb.DBPort.findOne(DBPort.java:123)
at com.mongodb.DBPort.runCommand(DBPort.java:129)
at com.mongodb.DBTCPConnector._checkWriteError(DBTCPConnector.java:
123)
at com.mongodb.DBTCPConnector.say(DBTCPConnector.java:153)
at com.mongodb.DBApiLayer$MyCollection.update(DBApiLayer.java:289)
at com.mongodb.DBCollection.save(DBCollection.java:534)
at com.google.code.morphia.DatastoreImpl.save(DatastoreImpl.java:638)
at com.google.code.morphia.DatastoreImpl.save(DatastoreImpl.java:685)
at com.google.code.morphia.DatastoreImpl.save(DatastoreImpl.java:679)
at
com.cisco.cas.paymentgateway.CreditCardProfileThread.run(CreditCardProfileThread.java:
22)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.EOFException
at org.bson.io.Bits.readFully(Bits.java:32)
at com.mongodb.Response.<init>(Response.java:34)
at com.mongodb.DBPort.go(DBPort.java:95)
at com.mongodb.DBPort.findOne(DBPort.java:115)
... 10 more

I believe Java driver 2.2 figures it out which is new primary but that
is not what I am observing.

Once I get an exception, I also wait for 20 seconds for replica set to
elect a new primary.
I am not sure flow how Java driver and application should behave
once primary fails.
I am not sure it clearly documented with use case. This is very
typical deployment where primary fails and secondary takes over and
application logic should be oblivious to internal failure/recover
mechanism

Please enlighten me

Thanks
Rajan Bhatt

Eliot Horowitz

unread,
Oct 28, 2010, 9:46:09 PM10/28/10
to mongod...@googlegroups.com
The first write after a failure of the master fails might fail.
The driver tries to pre-empt that, but maybe could be better.
Can you try with the new 2.3 version?

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

Robert Stewart

unread,
Oct 29, 2010, 3:08:34 AM10/29/10
to mongodb-user
I was seeing the same problems that Rajan reported with the 2.2 driver
earlier today. The 2.3 driver is much better. In fact, it works
exactly as expected.

I've got an app that is using Log4mongo-java to store log events in a
collection. I started the app with Log4mongo-java pointed at my three-
member replica set running on localhost with one of the members set to
arbiterOnly.

While the app was logging to primary, I killed primary. Secondary
elected itself to be the new primary after a delay of less than two
seconds.

My app logged the following to the console:

log4j:ERROR Failed to insert document to MongoDB
com.mongodb.MongoException$Network: can't say something
at com.mongodb.DBTCPConnector.say(DBTCPConnector.java:169)
...
Caused by: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
...

I then used the mongo shell to connect to the new primary and
confirmed that it missed only the statements that were logged in the
short period between me killing the first instance and the second
instance taking over as primary.

Dan Harvey

unread,
Oct 29, 2010, 10:49:21 AM10/29/10
to mongodb-user
I am seeing the same thing too. I was using the 2.2 driver but now
using the 2.3 driver things don't seem to have changed.

I am connecting to a 3 server replica set with :-

MongoOptions mo = new MongoOptions();
mo.autoConnectRetry = true;

Mongo mongo = new Mongo(Arrays.asList(new ServerAddress("data-
node-1"), new ServerAddress("data-node-2"), new ServerAddress("data-
node-3")), mo);
DB db = mongo.getDB("dm");

DBCollection docs = db.getCollection("documents");

Then have a loop of id's I'm grabbing to benchmark the setup :-

for (String id: ids) {
// Do stats on response time
DBObject key = new BasicDBObject();
key.put("id", id);

// Benchmark this
DBCursor results = docs.find(key);
}
When I kill the Primary node on the cluster, one of the other two
servers take over as Primary, but I end up with the following errors
from the driver whilst being in the loop :-

29-Oct-2010 15:37:37 com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: data-node-1:27017 java.net.SocketTimeoutException:
Read timed out
29-Oct-2010 15:37:55 com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: data-node-1:27017 java.io.IOException: couldn't
connect to [data-node-1/192.168.2.146:27017]
bc:java.net.NoRouteToHostException: No route to host
29-Oct-2010 15:38:06 com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: data-node-1:27017 java.io.IOException: couldn't
connect to [data-node-1/192.168.2.146:27017]
bc:java.net.NoRouteToHostException: No route to host
29-Oct-2010 15:38:17 com.mongodb.ReplicaSetStatus$Node update
(and so on..)

Do I need to connect in a different way to allow the driver to
automatically use the new Primary server?

Thanks,

Eliot Horowitz

unread,
Oct 29, 2010, 11:02:14 AM10/29/10
to mongod...@googlegroups.com
You're connecting correctly.
You'll see those messages for now - its a background thread checking things.
Does your app work though?

Dan Harvey

unread,
Oct 29, 2010, 1:30:17 PM10/29/10
to mongodb-user
No the app doesn't work, adding a print statement in the loop to see
what results are returned I find that no more are output once the
primary node has switched over.
I've just ran it again and after 5 minutes it is still just
outputting :-

29-Oct-2010 18:28:34 com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: data-node-1:27017 java.io.IOException: couldn't
connect to [data-node-1/192.168.2.146:27017]
bc:java.net.NoRouteToHostException: No route to host

Recursively, it seems like it's not finding the new primary server?

Eliot Horowitz

unread,
Oct 29, 2010, 1:31:42 PM10/29/10
to mongod...@googlegroups.com
Can you send the code you're using to test?
We have a similar test that works correctly.

Those messages doesn't mean its not finding the master, just that its
found a down node

Dan

unread,
Oct 29, 2010, 1:36:29 PM10/29/10
to mongod...@googlegroups.com
yes sure, I've put the code up here on gist http://gist.github.com/653968

Eliot Horowitz

unread,
Oct 29, 2010, 1:39:55 PM10/29/10
to mongod...@googlegroups.com
The first request after the master dies will fail.
So if you put a try/catch inside the for loop, you should get a couple
failures while a new master is elected, then should start working.

Dan

unread,
Oct 29, 2010, 3:32:54 PM10/29/10
to mongod...@googlegroups.com
I've added the try catch for all runtime exceptions around the inner loop (which I think is what the driver throws?) then tried the Primary node fail over again.
What happens is that as soon as I kill the node it will stop the queries taking place, and just outputting that it can't connect to the node that just failed, it doesn't print out any exceptions from the attempts with the find() or getting data from the DBCursor.

After about 15 minutes it will output the following exception, and start working again :-

29-Oct-2010 19:56:47 com.mongodb.DBTCPConnector _error
WARNING: replica set mode, switching master
java.net.SocketException: No route to host
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:146)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at org.bson.io.Bits.readFully(Bits.java:35)
at org.bson.io.Bits.readFully(Bits.java:28)
at com.mongodb.Response.<init>(Response.java:35)
at com.mongodb.DBPort.go(DBPort.java:101)
at com.mongodb.DBPort.go(DBPort.java:66)
at com.mongodb.DBPort.call(DBPort.java:56)
at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:211)
at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:284)
at com.mongodb.DBCursor._check(DBCursor.java:297)
at com.mongodb.DBCursor._hasNext(DBCursor.java:420)
at com.mongodb.DBCursor.hasNext(DBCursor.java:445)
at com.mendeley.catalog.serving.test.ReadBenchmark.main(ReadBenchmark.java:55)
29-Oct-2010 19:56:47 com.mongodb.DBTCPConnector$MyPort error
SEVERE: MyPort.error called
java.net.SocketException: No route to host
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:146)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at org.bson.io.Bits.readFully(Bits.java:35)
at org.bson.io.Bits.readFully(Bits.java:28)
at com.mongodb.Response.<init>(Response.java:35)
at com.mongodb.DBPort.go(DBPort.java:101)
at com.mongodb.DBPort.go(DBPort.java:66)
at com.mongodb.DBPort.call(DBPort.java:56)
at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:211)
at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:284)
at com.mongodb.DBCursor._check(DBCursor.java:297)
at com.mongodb.DBCursor._hasNext(DBCursor.java:420)
at com.mongodb.DBCursor.hasNext(DBCursor.java:445)
at com.mendeley.catalog.serving.test.ReadBenchmark.main(ReadBenchmark.java:55)

So I guess it's just trying a few too many times to re-connect to the previous primary before it fail-over? In the mongod http interface the primary is elected with 30s.

Where do you set how many retries / a time out for failing over to the new primary?

Thanks,

Eliot Horowitz

unread,
Oct 29, 2010, 3:34:46 PM10/29/10
to mongod...@googlegroups.com
It should do it at the first failure.
Can you send the current version of the code - will give it a try.

Dan

unread,
Oct 30, 2010, 8:49:15 AM10/30/10
to mongod...@googlegroups.com
The latest code with the try catch is here :- http://gist.github.com/655263

I think it is failing over on the first failure, but taking 15 minutes to actually fail the find()?

Thanks for the fast replies!

Eliot Horowitz

unread,
Oct 31, 2010, 8:53:50 PM10/31/10
to mongod...@googlegroups.com
How are you killing the master?

Dan

unread,
Nov 1, 2010, 5:58:28 AM11/1/10
to mongod...@googlegroups.com
Doing ifconfig eth0 down on that node, so to the others the machine will appear to have gone down. I guess another way could be to kill -9 the daemon, but I don't see there should be much difference between the two from the point of view of the rest of the system?

I've got three Debian instances setup in virtualbox on my machine each with mongoDB as part of the replica set.

Eliot Horowitz

unread,
Nov 1, 2010, 8:46:03 AM11/1/10
to mongod...@googlegroups.com
The difference is that on an if down the socket may not fail quickly. On a kill -9 it would. You can set a socket timeout of you know all your queries should be fast

Dan

unread,
Nov 1, 2010, 9:19:57 AM11/1/10
to mongod...@googlegroups.com
Ah ok, kill -9 works as expected failing-over after 15 seconds. But would the same thing with socket time-outs not happen if for example the machine completely failed (power failure) cauing the network to be taken down like this? So right now the socket time-out should be a lot less to come with time-outs like this?

If that is the case would there possibility a way to include a heart beat in the client? maybe putting a "ping" in the protocol whilst either the client/server is blocking until they reply?

I've also noticed that if you take the network down rather than kill -9 the daemon the fail-over on the replica set nodes takes long (about 30s rarther than < 1s), and the secondary nodes end up in the state of "recovering" more often. With the design of the failure detection for mongodb does it seem to work better when sockets are closed rather than being left hanging?

Thanks, 

Eliot Horowitz

unread,
Nov 1, 2010, 11:27:08 AM11/1/10
to mongod...@googlegroups.com
There is a heartbeat but it can't really interrupt an existing request.
As soon as the heartbeat detects a problem will re-route.

As for failover time, its the same issue. Takes longer to detect
network issues that hard issues.

Dan

unread,
Nov 1, 2010, 11:50:05 AM11/1/10
to mongod...@googlegroups.com
I think for us it's fine to set the socket timeout lower, so I'll do that to help us with those type of failure.

I'm impressed with the speed of replies to this mailing list too, so thanks for helping to speed debugging this up!

RajAhm

unread,
Nov 3, 2010, 6:10:27 PM11/3/10
to mongodb-user
Does Morphia DataStore Object needs to re-created after Master
failover ?

ds = morphia.createDatastore( m,
mProps.getProperty("mongoDBDBName") );

Where ds Morphia Datastore object and m is Mongo instance.

In my design I have two asynchronous threads running. One thread which
is waiting on Blocking queue to save Object which is annotated with
Morphia annotations and another thread directly use BasicDBObject for
persistence. This BasicDBObject is effectively able to save after
Master failover where Morphia annotated is not able to save even after
failover.
Should I create new datastore object after Master failover or Morphia
should take care of it ?

Rajan

On Nov 1, 8:50 am, Dan <danharve...@gmail.com> wrote:
> I think for us it's fine to set the socket timeout lower, so I'll do that to
> help us with those type of failure.
>
> I'm impressed with the speed of replies to this mailing list too, so thanks
> for helping to speed debugging this up!
>
> On Mon, Nov 1, 2010 at 3:27 PM, Eliot Horowitz <eliothorow...@gmail.com>wrote:
>
> > There is a heartbeat but it can't really interrupt an existing request.
> > As soon as the heartbeat detects a problem will re-route.
>
> > As for failover time, its the same issue.  Takes longer to detect
> > network issues that hard issues.
>
> > On Mon, Nov 1, 2010 at 9:19 AM, Dan <danharve...@gmail.com> wrote:
> > > Ah ok, kill -9 works as expected failing-over after 15 seconds. But would
> > > the same thing with socket time-outs not happen if for example the
> > > machine completely failed (power failure) cauing the network to be taken
> > > down like this? So right now the socket time-out should be a lot less to
> > > come with time-outs like this?
> > > If that is the case would there possibility a way to include a heart beat
> > in
> > > the client? maybe putting a "ping" in the protocol whilst either the
> > > client/server is blocking until they reply?
> > > I've also noticed that if you take the network down rather than kill -9
> > the
> > > daemon the fail-over on the replica set nodes takes long (about 30s
> > rarther
> > > than < 1s), and the secondary nodes end up in the state of "recovering"
> > more
> > > often. With the design of the failure detection for mongodb does it seem
> > to
> > > work better when sockets are closed rather than being left hanging?
> > > Thanks,
> > > On Mon, Nov 1, 2010 at 12:46 PM, Eliot Horowitz <eliothorow...@gmail.com
>
> > > wrote:
>
> > >> The difference is that on an if down the socket may not fail quickly. On
> > a
> > >> kill -9 it would. You can set a socket timeout of you know all your
> > queries
> > >> should be fast
>
> > >> On Nov 1, 2010, at 5:58 AM, Dan <danharve...@gmail.com> wrote:
>
> > >> Doing ifconfig eth0 down on that node, so to the others the machine will
> > >> appear to have gone down. I guess another way could be to kill -9 the
> > >> daemon, but I don't see there should be much difference between the two
> > from
> > >> the point of view of the rest of the system?
> > >> I've got three Debian instances setup in virtualbox on my machine each
> > >> with mongoDB as part of the replica set.
>
> > >> On Mon, Nov 1, 2010 at 12:53 AM, Eliot Horowitz <
> > eliothorow...@gmail.com>
> > >> wrote:
>
> > >>> How are you killing the master?
>
> > >>> On Sat, Oct 30, 2010 at 8:49 AM, Dan <danharve...@gmail.com> wrote:
> > >>> > The latest code with the try catch is here
> > >>> > :-http://gist.github.com/655263
> > >>> > I think it is failing over on the first failure, but taking
> > >>> > 15 minutes to
> > >>> > actually fail the find()?
> > >>> > Thanks for the fast replies!
> > >>> > On Fri, Oct 29, 2010 at 8:34 PM, Eliot Horowitz
> > >>> > <eliothorow...@gmail.com>
> > >>> > wrote:
>
> > >>> >> It should do it at the first failure.
> > >>> >> Can you send the current version of the code - will give it a try.
>
> > >>> >> > <eliothorow...@gmail.com>
> > >>> >> > wrote:
>
> > >>> >> >> The first request after the master dies will fail.
> > >>> >> >> So if you put a try/catch inside the for loop, you should get a
> > >>> >> >> couple
> > >>> >> >> failures while a new master is elected, then should start
> > working.
>
> > >>> >> >> On Fri, Oct 29, 2010 at 1:36 PM, Dan <danharve...@gmail.com>
> > wrote:
> > >>> >> >> > yes sure, I've put the code up here on
> > >>> >> >> > gisthttp://gist.github.com/653968
> > >>> >> >> > On Fri, Oct 29, 2010 at 6:31 PM, Eliot Horowitz
> > >>> >> >> > <eliothorow...@gmail.com>
> > >>> >> >> > wrote:
>
> > >>> >> >> >> Can you send the code you're using to test?
> > >>> >> >> >> We have a similar test that works correctly.
>
> > >>> >> >> >> Those messages doesn't mean its not finding the master, just
> > >>> >> >> >> that
> > >>> >> >> >> its
> > >>> >> >> >> found a down node
>
> > >>> >> >> >> On Fri, Oct 29, 2010 at 1:30 PM, Dan Harvey
> > >>> >> >> >> <danharve...@gmail.com>
> ...
>
> read more »

RajAhm

unread,
Nov 3, 2010, 6:11:30 PM11/3/10
to mongodb-user
Does Morphia DataStore Object needs to re-created after Master
failover ?

ds = morphia.createDatastore( m,
mProps.getProperty("mongoDBDBName") );

Where ds Morphia Datastore object and m is Mongo instance.

In my design I have two asynchronous threads running. One thread which
is waiting on Blocking queue to save Object which is annotated with
Morphia annotations and another thread directly use BasicDBObject for
persistence. This BasicDBObject is effectively able to save after
Master failover where Morphia annotated is not able to save even after
failover.
Should I create new datastore object after Master failover or Morphia
should take care of it ?

Rajan

On Nov 1, 8:50 am, Dan <danharve...@gmail.com> wrote:
> I think for us it's fine to set the socket timeout lower, so I'll do that to
> help us with those type of failure.
>
> I'm impressed with the speed of replies to this mailing list too, so thanks
> for helping to speed debugging this up!
>
> On Mon, Nov 1, 2010 at 3:27 PM, Eliot Horowitz <eliothorow...@gmail.com>wrote:
>
> > There is a heartbeat but it can't really interrupt an existing request.
> > As soon as the heartbeat detects a problem will re-route.
>
> > As for failover time, its the same issue.  Takes longer to detect
> > network issues that hard issues.
>
> > On Mon, Nov 1, 2010 at 9:19 AM, Dan <danharve...@gmail.com> wrote:
> > > Ah ok, kill -9 works as expected failing-over after 15 seconds. But would
> > > the same thing with socket time-outs not happen if for example the
> > > machine completely failed (power failure) cauing the network to be taken
> > > down like this? So right now the socket time-out should be a lot less to
> > > come with time-outs like this?
> > > If that is the case would there possibility a way to include a heart beat
> > in
> > > the client? maybe putting a "ping" in the protocol whilst either the
> > > client/server is blocking until they reply?
> > > I've also noticed that if you take the network down rather than kill -9
> > the
> > > daemon the fail-over on the replica set nodes takes long (about 30s
> > rarther
> > > than < 1s), and the secondary nodes end up in the state of "recovering"
> > more
> > > often. With the design of the failure detection for mongodb does it seem
> > to
> > > work better when sockets are closed rather than being left hanging?
> > > Thanks,
> > > On Mon, Nov 1, 2010 at 12:46 PM, Eliot Horowitz <eliothorow...@gmail.com
>
> > > wrote:
>
> > >> The difference is that on an if down the socket may not fail quickly. On
> > a
> > >> kill -9 it would. You can set a socket timeout of you know all your
> > queries
> > >> should be fast
>
> > >> On Nov 1, 2010, at 5:58 AM, Dan <danharve...@gmail.com> wrote:
>
> > >> Doing ifconfig eth0 down on that node, so to the others the machine will
> > >> appear to have gone down. I guess another way could be to kill -9 the
> > >> daemon, but I don't see there should be much difference between the two
> > from
> > >> the point of view of the rest of the system?
> > >> I've got three Debian instances setup in virtualbox on my machine each
> > >> with mongoDB as part of the replica set.
>
> > >> On Mon, Nov 1, 2010 at 12:53 AM, Eliot Horowitz <
> > eliothorow...@gmail.com>
> > >> wrote:
>
> > >>> How are you killing the master?
>
> > >>> On Sat, Oct 30, 2010 at 8:49 AM, Dan <danharve...@gmail.com> wrote:
> > >>> > The latest code with the try catch is here
> > >>> > :-http://gist.github.com/655263
> > >>> > I think it is failing over on the first failure, but taking
> > >>> > 15 minutes to
> > >>> > actually fail the find()?
> > >>> > Thanks for the fast replies!
> > >>> > On Fri, Oct 29, 2010 at 8:34 PM, Eliot Horowitz
> > >>> > <eliothorow...@gmail.com>
> > >>> > wrote:
>
> > >>> >> It should do it at the first failure.
> > >>> >> Can you send the current version of the code - will give it a try.
>
> > >>> >> > <eliothorow...@gmail.com>
> > >>> >> > wrote:
>
> > >>> >> >> The first request after the master dies will fail.
> > >>> >> >> So if you put a try/catch inside the for loop, you should get a
> > >>> >> >> couple
> > >>> >> >> failures while a new master is elected, then should start
> > working.
>
> > >>> >> >> On Fri, Oct 29, 2010 at 1:36 PM, Dan <danharve...@gmail.com>
> > wrote:
> > >>> >> >> > yes sure, I've put the code up here on
> > >>> >> >> > gisthttp://gist.github.com/653968
> > >>> >> >> > On Fri, Oct 29, 2010 at 6:31 PM, Eliot Horowitz
> > >>> >> >> > <eliothorow...@gmail.com>
> > >>> >> >> > wrote:
>
> > >>> >> >> >> Can you send the code you're using to test?
> > >>> >> >> >> We have a similar test that works correctly.
>
> > >>> >> >> >> Those messages doesn't mean its not finding the master, just
> > >>> >> >> >> that
> > >>> >> >> >> its
> > >>> >> >> >> found a down node
>
> > >>> >> >> >> On Fri, Oct 29, 2010 at 1:30 PM, Dan Harvey
> > >>> >> >> >> <danharve...@gmail.com>
> ...
>
> read more »

Scott Hernandez

unread,
Nov 3, 2010, 8:40:24 PM11/3/10
to mongod...@googlegroups.com
On Wed, Nov 3, 2010 at 3:10 PM, RajAhm <rajan...@sbcglobal.net> wrote:
> Does Morphia DataStore Object needs to re-created after Master
> failover ?

No, it uses the underlying Mongo instance, hence the Datastore
construction below.

> ds = morphia.createDatastore( m,
> mProps.getProperty("mongoDBDBName") );
>
> Where ds Morphia Datastore object and m is Mongo instance.
>
> In my design I have two asynchronous threads running. One thread which
> is waiting on Blocking queue to save Object which is annotated with
> Morphia annotations and another thread directly use BasicDBObject for
> persistence. This BasicDBObject is effectively able to save after
> Master failover where Morphia annotated is not able to save even after
> failover.

Are you checking for write errors? By default the datastore will make
all writes "Safe" so that exceptions can be raised. This is not the
default for the driver writes.

> Should I create new datastore object after  Master failover or Morphia
> should take care of it ?

No, unless you are creating a new Mongo instance.

RajAhm

unread,
Nov 4, 2010, 1:08:11 PM11/4/10
to mongodb-user
Thanks Scott for reply.

I am still running into some issues.
This is my sample code. In nutshell what I am doing is, this thread is
blocking new data to be inserted in Mongo.

===========================================
public class CreditCardProfileThread implements Runnable {

@Override
public void run() {
CreateCustomerProfile ccProf = null;
// TODO Auto-generated method stub
System.out
.println(" I am in Create Credit Card Profile runnable interface
");
while (true) {
try {
System.out
.println(" I am blocking on queue for Create Credit Card
Profile");
ccProf = CreditCardInfoImpl.CreditCardProfilequeue.take();

System.out
.println(" I am just unblocked and saving data - Credit Card
Profile ");

CreditCardInfoImpl.ds.save(ccProf);

} catch (InterruptedException e) {
// TODO Auto-generated catch block
System.out.println(" I am here1");
// e.printStackTrace();
} catch (Exception ie) {

System.out.println(" I am here2");
System.out.println(" Exception caught : " + ie.getMessage());
try {
System.out.println(" Sleeping starts ");
Thread.sleep(20000);
System.out.println(" wakeup thread ");
CreditCardInfoImpl.createNewMorphiaDataStore();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
System.out.println(" I am interrupted Create Card Profile Thread
");
// e.printStackTrace();
}

// Let us drain all elements to Collection of type Vector while we
were down
Vector<CreateCustomerProfile> vct = new
Vector<CreateCustomerProfile>();
CreditCardInfoImpl.CreditCardProfilequeue.drainTo(vct);
// Add the element which we already removed from queue
vct.add(ccProf);
for( CreateCustomerProfile c : vct )
{
System.out.println(" Saving a Create profile ");
CreditCardInfoImpl.ds.save(c);
}

}

}
========
If thread gets data, it wakes up stores it ( This is a good case where
all Replica set members are running ).

When I kill ( CTRL-C) Primary member. I get an exception and at that
new Primary member election starts. It may take say 10 seconds. I
sleep for 20 seconds ans then try to save data again then I get
following exception. It seems that this thread dies after trying to
save after 20 seconds sleeps..


ov 4, 2010 9:47:03 AM com.mongodb.DBTCPConnector$MyPort error
SEVERE: MyPort.error called
com.mongodb.MongoInternalException: DBPort.findOne failed
at com.mongodb.DBPort.findOne(DBPort.java:129)
at com.mongodb.DBPort.runCommand(DBPort.java:135)
at com.mongodb.DBTCPConnector._checkWriteError(DBTCPConnector.java:
121)
at com.mongodb.DBTCPConnector.say(DBTCPConnector.java:157)
at com.mongodb.DBTCPConnector.say(DBTCPConnector.java:141)
at com.mongodb.DBApiLayer$MyCollection.update(DBApiLayer.java:317)
at com.mongodb.DBCollection.save(DBCollection.java:534)
at com.google.code.morphia.DatastoreImpl.save(DatastoreImpl.java:638)
at com.google.code.morphia.DatastoreImpl.save(DatastoreImpl.java:685)
at com.google.code.morphia.DatastoreImpl.save(DatastoreImpl.java:679)
at
com.cisco.cas.paymentgateway.CreditCardProfileThread.run(CreditCardProfileThread.java:
22)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.EOFException
at org.bson.io.Bits.readFully(Bits.java:37)
at org.bson.io.Bits.readFully(Bits.java:28)
at com.mongodb.Response.<init>(Response.java:35)
at com.mongodb.DBPort.go(DBPort.java:101)
at com.mongodb.DBPort.go(DBPort.java:66)
at com.mongodb.DBPort.findOne(DBPort.java:121)
... 11 more
I am here2
Exception caught : DBPort.findOne failed
Sleeping starts
Nov 4, 2010 9:47:09 AM com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: 171.69.120.229:27018 java.io.IOException: couldn't
connect to [/171.69.120.229:27018] bc:java.net.ConnectException:
Connection refused: connect
Nov 4, 2010 9:47:16 AM com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: 171.69.120.229:27018 java.io.IOException: couldn't
connect to [/171.69.120.229:27018] bc:java.net.ConnectException:
Connection refused: connect
Nov 4, 2010 9:47:23 AM com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: 171.69.120.229:27018 java.io.IOException: couldn't
connect to [/171.69.120.229:27018] bc:java.net.ConnectException:
Connection refused: connect
wakeup thread
Saving a Create profile
Nov 4, 2010 9:47:24 AM com.mongodb.DBPort _open
SEVERE: going to sleep and retry. total sleep time after = 2000ms
this time:100ms
Nov 4, 2010 9:47:26 AM com.mongodb.DBPort _open
SEVERE: going to sleep and retry. total sleep time after = 4156ms
this time:200ms
Nov 4, 2010 9:47:27 AM com.mongodb.DBPort _open
SEVERE: going to sleep and retry. total sleep time after = 6374ms
this time:400ms
Nov 4, 2010 9:47:28 AM com.mongodb.DBPort _open
SEVERE: going to sleep and retry. total sleep time after = 8782ms
this time:800ms
Nov 4, 2010 9:47:30 AM com.mongodb.DBPort _open
SEVERE: going to sleep and retry. total sleep time after = 12188ms
this time:1600ms
Nov 4, 2010 9:47:30 AM com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: 171.69.120.229:27018 java.io.IOException: couldn't
connect to [/171.69.120.229:27018] bc:java.net.ConnectException:
Connection refused: connect
Nov 4, 2010 9:47:32 AM com.mongodb.DBPort _open
SEVERE: going to sleep and retry. total sleep time after = 17438ms
this time:3200ms
Nov 4, 2010 9:47:36 AM com.mongodb.DBPort _open
SEVERE: going to sleep and retry. total sleep time after = 25874ms
this time:2063ms
Nov 4, 2010 9:47:37 AM com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: 171.69.120.229:27018 java.io.IOException: couldn't
connect to [/171.69.120.229:27018] bc:java.net.ConnectException:
Connection refused: connect
Nov 4, 2010 9:47:39 AM com.mongodb.DBTCPConnector$MyPort error
SEVERE: MyPort.error called
java.io.IOException: couldn't connect to [/171.69.120.229:27018]
bc:java.net.ConnectException: Connection refused: connect
at com.mongodb.DBPort._open(DBPort.java:205)
at com.mongodb.DBPort.go(DBPort.java:85)
at com.mongodb.DBPort.go(DBPort.java:66)
at com.mongodb.DBPort.say(DBPort.java:61)
at com.mongodb.DBTCPConnector.say(DBTCPConnector.java:155)
at com.mongodb.DBTCPConnector.say(DBTCPConnector.java:141)
at com.mongodb.DBApiLayer$MyCollection.update(DBApiLayer.java:317)
at com.mongodb.DBCollection.save(DBCollection.java:534)
at com.google.code.morphia.DatastoreImpl.save(DatastoreImpl.java:638)
at com.google.code.morphia.DatastoreImpl.save(DatastoreImpl.java:685)
at com.google.code.morphia.DatastoreImpl.save(DatastoreImpl.java:679)
at
com.cisco.cas.paymentgateway.CreditCardProfileThread.run(CreditCardProfile

=================

}
=======


I am not sure how to solve this puzzle...
I am looking into it any insight/suggestion is appreciated.

Rajan

On Nov 3, 5:40 pm, Scott Hernandez <scotthernan...@gmail.com> wrote:
> ...
>
> read more »

Scott Hernandez

unread,
Nov 4, 2010, 3:27:23 PM11/4/10
to mongod...@googlegroups.com
This behavior is probably just the first operation going to the downed
node failing. You may want to retry in this case in your application
code.

RajAhm

unread,
Nov 4, 2010, 5:13:56 PM11/4/10
to mongodb-user
I thought 20 seconds is good enough time for switch over. That is the
reason I am sleeping for 20 seconds and trying to save again along
with other new updates which I may have receive during that 20
seconds.
In my case, I believe I am not catching an exception after 20 second
interval, here assumption is that I have already waited enough and
operation now would succeed.
What do you think ?
On Nov 4, 12:27 pm, Scott Hernandez <scotthernan...@gmail.com> wrote:
> This behavior is probably just the first operation going to the downed
> node failing. You may want to retry in this case in your application
> code.
>
> ...
>
> read more »

Scott Hernandez

unread,
Nov 4, 2010, 5:43:17 PM11/4/10
to mongod...@googlegroups.com
I would suggest catching the exception and retrying. I haven't done
extensive testing with refplicaset fail-over to see in which cases an
exception isn't caught, or required.

I believe the 2.3 driver does a better job of it though.

RajAhm

unread,
Nov 4, 2010, 6:36:43 PM11/4/10
to mongodb-user
I am using Java 2.3 driver. I think something is not right. I can see
that new Primary gets elected in 2 seconds but Java driver is not able
to handle somehow as perceived. I can retry after getting an exception
but wait of 20 seconds should be good enough I believe.

What do you think ?
Rajan

On Nov 4, 2:43 pm, Scott Hernandez <scotthernan...@gmail.com> wrote:
> I would suggest catching the exception and retrying. I haven't done
> extensive testing with refplicaset fail-over to see in which cases an
> exception isn't caught, or required.
>
> I believe the 2.3 driver does a better job of it though.
>
> ...
>
> read more »

RajAhm

unread,
Nov 4, 2010, 7:23:02 PM11/4/10
to mongodb-user
Some observation..

I am observing following errors on one of secondary console..

ed Mar 10 09:08:39 [rs_sync] replSet SECONDARY
Wed Mar 10 09:08:51 [conn3] update exception 10054 not master 0ms
Wed Mar 10 09:10:23 [conn2] insert exception 10058 not master 0ms
Wed Mar 10 09:36:28 [conn2] insert exception 10058 not master 0ms

Why MongoDB Java driver still trying to insert into Secondary after
fail over..

Any idea

On Nov 4, 2:43 pm, Scott Hernandez <scotthernan...@gmail.com> wrote:
> I would suggest catching the exception and retrying. I haven't done
> extensive testing with refplicaset fail-over to see in which cases an
> exception isn't caught, or required.
>
> I believe the 2.3 driver does a better job of it though.
>
> ...
>
> read more »

Eliot Horowitz

unread,
Nov 4, 2010, 8:14:33 PM11/4/10
to mongod...@googlegroups.com
In 1.6 if your'e not using getLastError will take up to 20 seconds to fail over.
If the server is 1.7 it will be almost instantaneous

Joseph Wang

unread,
Nov 4, 2010, 9:09:13 PM11/4/10
to mongod...@googlegroups.com
I'm using 2.3 Java driver against 1.7.2 servers. It seems to take more than 20
secs.
The insertion seems to hang. There is no additional document added after killing
primary with kill -9.

Nov 4, 2010 8:58:38 PM com.mongodb.ReplicaSetStatus$Node update
WARNING: node down: ip-10-166-59-166:20000 java.net.SocketException: Connection
reset
Nov 4, 2010 8:58:38 PM com.mongodb.DBTCPConnector$MyPort error


SEVERE: MyPort.error called
com.mongodb.MongoInternalException: DBPort.findOne failed
at com.mongodb.DBPort.findOne(DBPort.java:129)
at com.mongodb.DBPort.runCommand(DBPort.java:135)
at com.mongodb.DBTCPConnector._checkWriteError(DBTCPConnector.java:121)
at com.mongodb.DBTCPConnector.say(DBTCPConnector.java:157)
at com.mongodb.DBTCPConnector.say(DBTCPConnector.java:141)

at com.mongodb.DBApiLayer$MyCollection.insert(DBApiLayer.java:241)
at com.mongodb.DBApiLayer$MyCollection.insert(DBApiLayer.java:197)
at com.leadpoint.db.DbInsertListWorker.run(DbInsertListWorker.java:52)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)


at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.EOFException
at org.bson.io.Bits.readFully(Bits.java:37)
at org.bson.io.Bits.readFully(Bits.java:28)
at com.mongodb.Response.<init>(Response.java:35)
at com.mongodb.DBPort.go(DBPort.java:101)
at com.mongodb.DBPort.go(DBPort.java:66)
at com.mongodb.DBPort.findOne(DBPort.java:121)

... 10 more
DbInsertListWorker: exception com.mongodb.MongoInternalException: DBPort.findOne
failed
...
WARNING: node down: ip-10-166-59-166:20000 java.io.IOException: couldn't connect
to [ip-10-166-59-166/10.166.59.166:20000] bc:java.net.ConnectException:
Connection refused


Code:
try {
if (mongo != null) {
DB db = mongo.getDB(dbName);
if (db != null) {
db.requestStart();
DBCollection collection = db
.getCollection(collectionName);
if (collection != null) {
db.resetError();
collection.insert(list.toArray(new DBObject[list
.size()]), WriteConcern.REPLICAS_SAFE);
DBObject dbObject = db.getPreviousError();
if (dbObject != null && dbObject.get("err") != null)
{
System.out
.println("Info DbInsertListWorker
preverror:"
+ dbObject.toString());
}
dbObject = db
.getLastError(WriteConcern.REPLICAS_SAFE);
if (dbObject != null && dbObject.get("err") != null)
{
System.out
.println("Info DbInsertListWorker
lasterror:"
+ dbObject.toString());
}
} else {
System.out
.println("Info DbInsertListWorker: empty
collection");
}
db.requestDone();
} else {
System.out.println("Info DbInsertListWorker: empty db");
}
} else {
System.out.println("Info DbInsertListWorker: empty mongo");
}
} catch (Exception ex) {
System.out.println("DbInsertListWorker: exception " + ex);
ex.printStackTrace();

Joseph Wang

unread,
Nov 5, 2010, 5:22:07 PM11/5/10
to mongod...@googlegroups.com
Killed the primary while running the insert. Even though the primary switched
from ip-10-166-59-166:20000 to ip-10-166-57-74:20000, the driver is still
connected to ip-10-166-59-166:20000. Is there a way to force com.mongodb.Mongo
instance to switch?


mongo.debugString() showed:
DBTCPConnector: replica set : [ip-10-166-59-166:20000,
ip-10-166-57-74:20000]

mongo.getConnectPoint() showed:
ip-10-166-59-166:20000

Reply all
Reply to author
Forward
0 new messages