PHP MongoDB errors when using automatic failover.

418 views
Skip to first unread message

tres

unread,
May 11, 2011, 5:25:41 AM5/11/11
to mongodb-user
I'm having the hardest time trying to figure this one out. I have a 2
server replica set with one arbiter running alongside the current
primary.

When I connect or issue queries, I get intermittent errors which range
from any one of the following:

- Could not determine master.
- Could not connect to [put host here].
- Broken pipe exceptions.
- MongoCursorExceptions.

It seems like they happen more often the longer the servers are up.
After awhile, no connections can be made and I get the following error
when I try and login to the mongo shell:

MongoDB shell version: 1.8.1
connecting to: test
Wed May 11 16:36:50 MessagingPort recv() errno:104 Connection reset by
peer 127.0.0.1:27017
Wed May 11 16:36:50 SocketException: remote: error: 9001 socket
exception [1]
Wed May 11 16:36:50 DBClientCursor::init call() failed
exception: DBClientBase::findOne: transport error: 127.0.0.1 query:
{ whatsmyuri: 1 }

(BTW, with verbose (-vvvv) logging, the 9001 socket exception seems to
come up with every query that is run.)

When I restart the mongo servers, things come back up and start
working, but not for long.

No matter what combination of connection options I used, nothing
worked. When I changed both webservers to connect to the primary
replica everything started working. No errors. No automatic failover :
(. Weird thing was, that the replicaSet option was still set which
means the PHP driver still needs to determine whether or not the one
it connected to was a master and if not connect to it, if so, do
nothing.

This tells me that the issue resides in the PHP driver where it tries
to connect to the master after attempting to determine if it is
connected to the master or not. Just to try and back up my theory, I
told one of the servers to connect to the secondary server.
Immediately, I started getting errors from that server. Not the other
server which was still connected to the primary. Switch back, no
errors.

Can anyone shed any light on this?

My setup:
- 2 webservers
- Apache/PHP on both
- MongoDB Replica-Set span across both
- Secondary on one
- Primary and Arbiter on the other
- MongoDB 1.8.1
- PHP Driver 1.1.4
- PHP 5.3.3
- Apache 2.2.17
- RedHat 5.5 (Tikanga)

For more information see:
http://stackoverflow.com/questions/5961951/php-mongodb-errors-when-using-automatic-failover

k

unread,
May 11, 2011, 10:02:26 AM5/11/11
to mongodb-user
What are the errors you're getting from the secondary? Can you look
at the primary and secondary logs and see if you see errors there when
you try to connect?

Also, can you try the (unreleased) 1.2.0 code (https://github.com/
mongodb/mongo-php-driver)? The connection code should work better.
> For more information see:http://stackoverflow.com/questions/5961951/php-mongodb-errors-when-us...

tres

unread,
May 11, 2011, 9:06:32 PM5/11/11
to mongodb-user
Without verbose logging, all I see is "connection accepted" and "ended
connection" messages. I turned on -vvvv logging and recorded response
times for when an exception was thrown. I then compare the server time
that I get from the response to the server time in the mongo logs.

Server 1: 192.168.1.37
Server 2: 192.168.1.38

> rs.conf()
{
"_id" : "exampleset",
"version" : 3,
"members" : [
{
"_id" : 0,
"host" : "192.168.1.37:27017"
},
{
"_id" : 1,
"host" : "192.168.1.38:27017"
},
{
"_id" : 2,
"host" : "192.168.1.38:27117",
"arbiterOnly" : true
}
]
}

Example response from web server on server 2:
{"error":"MongoCursorException","message":"couldn't get response
header","time":"Thu May 12 08:07:09","serverName":"server2"}

Example response from mongo server on the same server at the same
time:
Thu May 12 08:07:09 [conn4] query: admin.$cmd{ replSetHeartbeat:
"chat", v: 3, pv: 1, checkEmpty: false, from: "192.168.1.37:27017" }
Thu May 12 08:07:09 [conn4] run command admin.$cmd { replSetHeartbeat:
"chat", v: 3, pv: 1, checkEmpty: false, from: "192.168.1.37:27017" }
Thu May 12 08:07:09 [conn4] query admin.$cmd ntoreturn:1 command:
{ replSetHeartbeat: "chat", v: 3, pv: 1, checkEmpty: false, from:
"192.168.1.37:27017" } reslen:132 0ms

If I go back a few seconds earlier I can find a socket exception:
Thu May 12 08:07:04 [initandlisten] connection accepted from
192.168.1.38:60838 #20
Thu May 12 08:07:04 [conn20] query: admin.$cmd{ ismaster: 1 }
Thu May 12 08:07:04 [conn20] run command admin.$cmd { ismaster: 1 }
Thu May 12 08:07:04 [conn20] query admin.$cmd ntoreturn:1 command:
{ ismaster: 1 } reslen:222 0ms
Thu May 12 08:07:04 [conn20] MessagingPort recv() conn closed?
192.168.1.38:60838
Thu May 12 08:07:04 [conn20] SocketException: remote: error: 9001
socket exception [0]
Thu May 12 08:07:04 [conn20] end connection 192.168.1.38:60838

The next exceptions were on server 2:
{"error":"MongoCursorException","message":"couldn't get response
header","time":"Thu May 12 08:07:19","serverName":"server2"}
{"error":"MongoCursorException","message":"couldn't get response
header","time":"Thu May 12 08:07:35","serverName":"server2"}

Then awhile later on server 1:
{"error":"MongoCursorException","message":"couldn't get response
header","time":"Thu May 12 08:24:36","serverName":"server1"}

At this point, they started happening every few requests. The logs
don't match up to the exception times but I am still seeing
intermittent 9001 socket exceptions related to the one above which
probably doesn't have much to do with the exceptions returned as the
times are too far off.

The very odd thing with these errors is that when the servers are both
pointed to the primary, everything is fine. To get these errors, all I
did was point server 1 to the secondary running on server 1. Server 2
was still connecting to the primary running on server 2. However,
errors started happening on both. If I switch back to both using the
primary, errors happen until I restart Apache. Then everything is
fine. No exceptions in responses. If I restart Apache while one is
connected to the secondary, I still get errors.

I don't have to restart the replica set at all unless it gets to the
point to where I can't login to the shell which can happen if I leave
one connected to the secondary long enough.

No matter what I do, I always see the same 9001 socket exception in
the mongo logs.

Throughout all of these tests my Mongo connection instance had the
following configuration:

array(
'persist' => 'pconn',
'timeout' => 60,
'replicaSet' => true
)

But here's the kicker: If I remove the persist option, everything
works!

I'm thinking this is an issue with the way the PHP driver manages
persistent connections between instances when using different
connection strings and managing which connection belongs to which
server (primary/secondary etc.).

I haven't tried the 1.2 code yet. Is there a fix related to this in
that code? We are using this in a production environment, so if it
does fix it, I would want to be using a stable release. However, I
really don't want to have to make a choice between automatic failover
and persistent connections :(

Thank you guys for being so responsive. It's awesome to see so much
support behind an open source project.

-Trey

tres

unread,
May 12, 2011, 2:54:07 AM5/12/11
to mongodb-user
I upgraded to 1.2.0. No dice. Same issues.

Here is my rs.status():
{
"set" : "chat",
"date" : ISODate("2011-05-12T06:51:25Z"),
"myState" : 2,
"members" : [
{
"_id" : 0,
"name" : "192.168.1.37:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"optime" : {
"t" : 1305183060000,
"i" : 432
},
"optimeDate" : ISODate("2011-05-12T06:51:00Z"),
"self" : true
},
{
"_id" : 1,
"name" : "192.168.1.38:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 36,
"optime" : {
"t" : 1305183060000,
"i" : 432
},
"optimeDate" : ISODate("2011-05-12T06:51:00Z"),
"lastHeartbeat" : ISODate("2011-05-12T06:51:23Z")
},
{
"_id" : 2,
"name" : "192.168.1.38:27117",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 36,
"optime" : {
"t" : 0,
"i" : 0
},
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2011-05-12T06:51:23Z")
}
],
"ok" : 1
}


On May 12, 12:02 am, k <krist...@10gen.com> wrote:

Nat

unread,
May 12, 2011, 4:53:29 AM5/12/11
to mongodb-user
Did the problem happen right away or it took a while before it starts
happening? Can you check netstat on both client and server to see
whether you see anything out of ordinary when problem is happening?

k

unread,
May 12, 2011, 2:44:11 PM5/12/11
to mongodb-user
Thanks for all the info. Can you send the exception codes you're
getting, too? E.g., when you catch the MongoCursorException $e, call
$e->getCode()?

tres

unread,
May 12, 2011, 8:10:25 PM5/12/11
to mongodb-user
I went to add an exception code to the output this morning. Just
before I did, I found this:

https://jira.mongodb.org/browse/PHP-89

I was connecting with sockets so I decided to change it to localhost/
ip combo instead. After doing that, things now seem stable even when
using persistence and replicaSet together. So maybe the bug actually
exists when using sockets.

You guys may still want to look a bit more into this. I'll re-post the
exception codes if I start getting errors again.

Thanks for all your help.

-Trey

tres

unread,
May 12, 2011, 8:37:12 PM5/12/11
to mongodb-user
Was running some performance tests on the app and saw a few. Here are
the details:

MongoCursorException: Couldn't determine master, Code: 14

Fortunately, that is the only error that seems to be coming up so far.

tres

unread,
May 12, 2011, 8:41:38 PM5/12/11
to mongodb-user
And now I found a:

MongoCursorException: Couldn't get response header, Code: 4

On May 13, 4:44 am, k <krist...@10gen.com> wrote:

Nat

unread,
May 12, 2011, 9:04:16 PM5/12/11
to mongod...@googlegroups.com
tres,

Did you manage to check connection count from netstat both client and server side?
--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

k

unread,
May 17, 2011, 12:30:04 PM5/17/11
to mongodb-user
@tres: you probably will get some errors on failover, especially an
error like "couldn't determine master." It should go away once a new
master is up. Same with the "couldn't get response header": replica
set members close all connections on failover. As long as they go
away eventually (i.e., a few seconds to a minute), that is "works as
designed."

I'll take a look at the unix socket vs. ip issues... that definitely
isn't "works as designed"!

tres

unread,
May 19, 2011, 7:43:31 PM5/19/11
to mongodb-user
@k, Problem is, it isn't failing over. It would if I triggered it, but
I get the errors with normal operation.

@Nat, I checked netstat, nothing out of the ordinary. Noticed that if
there is a defunct apache process or a segfault while doing some
operations, that connection is never closed, however.

Did a test with a C# app and all was fine. PHP app had errors. Was
using base drivers, no wrappers.

I've got a 2h consult today so we'll see what we can gather from that.

Thanks, guys and I'll post anymore findings.

-Trey

On May 18, 2:30 am, k <krist...@10gen.com> wrote:
> @tres: you probably will get someerrorsonfailover, especially an
> error like "couldn't determine master."  It should go away once a new
> master is up. Same with the "couldn't get response header": replica
> set members close all connections onfailover.  As long as they go
> > > > > > What are theerrorsyou're getting from the secondary?  Can you look
> > > > > > at the primary and secondary logs and see if you seeerrorsthere when
> > > > > > you try to connect?
>
> > > > > > Also, can you try the (unreleased) 1.2.0 code (https://github.com/
> > > > > >mongodb/mongo-php-driver)?  The connection code should work better.
>
> > > > > > On May 11, 5:25 am, tres <treshug...@gmail.com> wrote:
>
> > > > > > > I'm having the hardest time trying to figure this one out. I have a 2
> > > > > > > server replica set with one arbiter running alongside the current
> > > > > > > primary.
>
> > > > > > > When I connect or issue queries, I get intermittenterrorswhich range
> > > > > > > from any one of the following:
>
> > > > > > > - Could not determine master.
> > > > > > > - Could not connect to [put host here].
> > > > > > > - Broken pipe exceptions.
> > > > > > > - MongoCursorExceptions.
>
> > > > > > > It seems like they happen more often the longer the servers are up.
> > > > > > > After awhile, no connections can be made and I get the following error
> > > > > > > when I try and login to the mongo shell:
>
> > > > > > >MongoDBshell version: 1.8.1
> > > > > > > connecting to: test
> > > > > > > Wed May 11 16:36:50 MessagingPort recv() errno:104 Connection reset by
> > > > > > > peer 127.0.0.1:27017
> > > > > > > Wed May 11 16:36:50 SocketException: remote:  error: 9001 socket
> > > > > > > exception [1]
> > > > > > > Wed May 11 16:36:50 DBClientCursor::init call() failed
> > > > > > > exception: DBClientBase::findOne: transport error: 127.0.0.1 query:
> > > > > > > { whatsmyuri: 1 }
>
> > > > > > > (BTW, with verbose (-vvvv) logging, the 9001 socket exception seems to
> > > > > > > come up with every query that is run.)
>
> > > > > > > When I restart the mongo servers, things come back up and start
> > > > > > > working, but not for long.
>
> > > > > > > No matter what combination of connection options I used, nothing
> > > > > > > worked. When I changed both webservers to connect to the primary
> > > > > > > replica everything started working. Noerrors. Noautomaticfailover:
> > > > > > > (. Weird thing was, that the replicaSet option was still set which
> > > > > > > means thePHPdriver still needs to determine whether or not the one
> > > > > > > it connected to was a master and if not connect to it, if so, do
> > > > > > > nothing.
>
> > > > > > > This tells me that the issue resides in thePHPdriver where it tries
> > > > > > > to connect to the master after attempting to determine if it is
> > > > > > > connected to the master or not. Just to try and back up my theory, I
> > > > > > > told one of the servers to connect to the secondary server.
> > > > > > > Immediately, I started gettingerrorsfrom that server. Not the other
> > > > > > > server which was still connected to the primary. Switch back, no
> > > > > > >errors.
>
> > > > > > > Can anyone shed any light on this?
>
> > > > > > > My setup:
> > > > > > > - 2 webservers
> > > > > > > - Apache/PHPon both
> > > > > > > -MongoDBReplica-Set span across both
> > > > > > > - Secondary on one
> > > > > > > - Primary and Arbiter on the other
> > > > > > > -MongoDB1.8.1
> > > > > > > -PHPDriver 1.1.4
> > > > > > > -PHP5.3.3
> > > > > > > - Apache 2.2.17
> > > > > > > - RedHat 5.5 (Tikanga)
>
> > > > > > > For more information see:http://stackoverflow.com/questions/5961951/php-mongodb-errors-when-us...
>
> > --
> > You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> > To post to this group, send email tomongo...@googlegroups.com.
> > To unsubscribe from this group, send email tomongodb-use...@googlegroups.com.

tres

unread,
May 19, 2011, 7:51:02 PM5/19/11
to mongodb-user
I'm also wondering if this could be linked to an issue with Apache
under a heavy load as we're using mod_php, not FastCGI. Was thinking
about trying Nginx using PHP-FPM to see if it works better.

-Trey

On May 20, 9:43 am, tres <treshug...@gmail.com> wrote:
> @k, Problem is, it isn't failing over. It would if I triggered it, but
> I get theerrorswith normal operation.
>
> @Nat, I checked netstat, nothing out of the ordinary. Noticed that if
> there is a defunct apache process or a segfault while doing some
> operations, that connection is never closed, however.
>
> Did a test with a C# app and all was fine.PHPapp haderrors. Wasusingbase drivers, no wrappers.

tres

unread,
May 19, 2011, 7:54:01 PM5/19/11
to mongodb-user
Sorry for the spam, but, come to think of it, when I ran tests through
CLI, I didn't have any issues at all unless I triggered a segfault.

By the way, passing "false" to the "persist" option is how I triggered
the segfault. I'm guessing that shouldn't be happening :). Driver
1.2.0.

-Trey

On May 20, 9:51 am, tres <treshug...@gmail.com> wrote:
> I'm also wondering if this could be linked to an issue with Apache
> under a heavy load as we'reusingmod_php, not FastCGI. Was thinking
> about trying NginxusingPHP-FPM to see if it works better.

Nat

unread,
May 19, 2011, 9:35:15 PM5/19/11
to mongod...@googlegroups.com
- When there is a segfault, what error do you see? Can you get a dump?
- Do you experience the same behavior as described in https://jira.mongodb.org/browse/PHP-200?

tres

unread,
May 27, 2011, 1:43:10 AM5/27/11
to mongodb-user
Nat,

Currently, the behavior is not the same. We expect after our testing
to be running it this way so we will find out soon enough.

Kristina,

I'd like to thank you for your help. I've been gradually changing
settings and re-running the performance tests and so far things are
stable. I have yet to turn on replication and then connecting to a
mongos, so I'm keeping my fingers crossed.

Another thing I've done is switch to using Nginx and PHP-FPM, which
possibly could have helped in conjunction to the changes. For scaling,
I'll never go back to Apache.

Btw, thanks for accepting the pull request :)

-Trey
Reply all
Reply to author
Forward
0 new messages