Without verbose logging, all I see is "connection accepted" and "ended
connection" messages. I turned on -vvvv logging and recorded response
times for when an exception was thrown. I then compare the server time
that I get from the response to the server time in the mongo logs.
Server 1: 192.168.1.37
Server 2: 192.168.1.38
> rs.conf()
{
"_id" : "exampleset",
"version" : 3,
"members" : [
{
"_id" : 0,
"host" : "
192.168.1.37:27017"
},
{
"_id" : 1,
"host" : "
192.168.1.38:27017"
},
{
"_id" : 2,
"host" : "
192.168.1.38:27117",
"arbiterOnly" : true
}
]
}
Example response from web server on server 2:
{"error":"MongoCursorException","message":"couldn't get response
header","time":"Thu May 12 08:07:09","serverName":"server2"}
Example response from mongo server on the same server at the same
time:
Thu May 12 08:07:09 [conn4] query: admin.$cmd{ replSetHeartbeat:
"chat", v: 3, pv: 1, checkEmpty: false, from: "
192.168.1.37:27017" }
Thu May 12 08:07:09 [conn4] run command admin.$cmd { replSetHeartbeat:
"chat", v: 3, pv: 1, checkEmpty: false, from: "
192.168.1.37:27017" }
Thu May 12 08:07:09 [conn4] query admin.$cmd ntoreturn:1 command:
{ replSetHeartbeat: "chat", v: 3, pv: 1, checkEmpty: false, from:
"
192.168.1.37:27017" } reslen:132 0ms
If I go back a few seconds earlier I can find a socket exception:
Thu May 12 08:07:04 [initandlisten] connection accepted from
192.168.1.38:60838 #20
Thu May 12 08:07:04 [conn20] query: admin.$cmd{ ismaster: 1 }
Thu May 12 08:07:04 [conn20] run command admin.$cmd { ismaster: 1 }
Thu May 12 08:07:04 [conn20] query admin.$cmd ntoreturn:1 command:
{ ismaster: 1 } reslen:222 0ms
Thu May 12 08:07:04 [conn20] MessagingPort recv() conn closed?
192.168.1.38:60838
Thu May 12 08:07:04 [conn20] SocketException: remote: error: 9001
socket exception [0]
Thu May 12 08:07:04 [conn20] end connection
192.168.1.38:60838
The next exceptions were on server 2:
{"error":"MongoCursorException","message":"couldn't get response
header","time":"Thu May 12 08:07:19","serverName":"server2"}
{"error":"MongoCursorException","message":"couldn't get response
header","time":"Thu May 12 08:07:35","serverName":"server2"}
Then awhile later on server 1:
{"error":"MongoCursorException","message":"couldn't get response
header","time":"Thu May 12 08:24:36","serverName":"server1"}
At this point, they started happening every few requests. The logs
don't match up to the exception times but I am still seeing
intermittent 9001 socket exceptions related to the one above which
probably doesn't have much to do with the exceptions returned as the
times are too far off.
The very odd thing with these errors is that when the servers are both
pointed to the primary, everything is fine. To get these errors, all I
did was point server 1 to the secondary running on server 1. Server 2
was still connecting to the primary running on server 2. However,
errors started happening on both. If I switch back to both using the
primary, errors happen until I restart Apache. Then everything is
fine. No exceptions in responses. If I restart Apache while one is
connected to the secondary, I still get errors.
I don't have to restart the replica set at all unless it gets to the
point to where I can't login to the shell which can happen if I leave
one connected to the secondary long enough.
No matter what I do, I always see the same 9001 socket exception in
the mongo logs.
Throughout all of these tests my Mongo connection instance had the
following configuration:
array(
'persist' => 'pconn',
'timeout' => 60,
'replicaSet' => true
)
But here's the kicker: If I remove the persist option, everything
works!
I'm thinking this is an issue with the way the PHP driver manages
persistent connections between instances when using different
connection strings and managing which connection belongs to which
server (primary/secondary etc.).
I haven't tried the 1.2 code yet. Is there a fix related to this in
that code? We are using this in a production environment, so if it
does fix it, I would want to be using a stable release. However, I
really don't want to have to make a choice between automatic failover
and persistent connections :(
Thank you guys for being so responsive. It's awesome to see so much
support behind an open source project.
-Trey