Sharding woes -- All chunks stuck on one shard

343 views
Skip to first unread message

Michael Conigliaro

unread,
Feb 17, 2011, 1:31:56 PM2/17/11
to mongodb-user
Hey guys,

I just reconfigured my MongoDB cluster to do sharding, and after 24
hours, all chunks still live on the same shard. I've read that this
can happen if you don't have enough data, but I think ~90 GB should be
more than enough to trigger the balancer...

A little background on my setup: I started out with a single replica
set of 3 nodes (master/slave/arbiter) but recently increased our
cluster to three replica sets of the same configuration so we could
shard our data. I also have three config servers with a mongos
installed on each one. Pretty straightforward setup I think. Oh, I'm
using the latest Ubuntu packages (1.6.5) on 10.04 LTS.

So by 2 days ago, I had successfully reconfigured all my application
servers to communicate via the mongos processes rather than by
connecting directly to my original replica set. I mostly followed
these instructions:

http://www.mongodb.org/display/DOCS/Upgrading+from+a+Non-Sharded+System

So far so good. But then yesterday morning, I attempted to shard my
first collection, and that's when the problems began:

db.runCommand( { shardcollection : "foo", key : { bar : 1 } } )

I expected this command to return pretty quickly, since my
understanding was that all sharding operations happen in the
background, but it hung for about 10 minutes. But it was worse than
that. None of my application servers could read or write to MongoDB
anymore, so I decided I had no choice but to start killing operations
until I found the one that was causing this. Luckily, the first one I
killed seemed to do the trick. But now I have no idea what state my
cluster is in...

If I run db.printShardingStatus(), I see that MongoDB has split my
collection into ~150 chunks, so that seems like a good sign. Also
db.locks.find({_id: 'balancer'}) seems to indicate that the balancer
is working:

{ "_id" : "balancer", "process" : "mongodb-
config01:1297892094:1804289383", "state" : 1, "ts" :
ObjectId("4d5c6cb8263a87e875a75a4c"), "when" : "Thu Feb 17 2011
00:32:56 GMT+0000 (GMT)", "who" : "mongodb-
config01:1297892094:1804289383:Balancer:846930886", "why" : "doing
balance round" }

However, after 24 hours, all chunks still live on the same shard.

I've checked the mongos.log, and I've never seen any indication that
the balancer is trying to move chunks around. I definitely see errors
in there, but I haven't been able to make sense of them. For example,
I see bursts of things like this:

Thu Feb 17 09:32:16 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 09:32:16 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:32:16 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 09:32:16 ERROR: couldn't unsert sharding : socket exception

Which look like mongos can't communicate with my two new shards.
However, I've never had problems manually connecting to the addresses
in these messages from a mongos machine, so I don't understand what's
causing these messages...

mongo --host 10.76.57.113 --port 27018
MongoDB shell version: 1.6.5
connecting to: 10.76.57.113:27018/test
> show dbs
admin
local
>

Here's another group of messages I see quite often, though I don't
know if they have anything to do with the sharding problem. The
address in these messages are not my MongoDB servers, so I have to
assume these are temporary connection errors from my application
servers:

Thu Feb 17 09:32:14 [conn1928] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:47386
Thu Feb 17 09:32:14 [conn1928] SocketException: 9001 socket exception
Thu Feb 17 09:32:14 ERROR: couldn't unset sharding :( { errmsg: "not
master", ok: 0.0 }
Thu Feb 17 09:32:14 [conn1907] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:47384
Thu Feb 17 09:32:14 [conn1907] SocketException: 9001 socket exception
Thu Feb 17 09:32:14 ERROR: couldn't unset sharding :( { errmsg: "not
master", ok: 0.0 }

Finally, I see messages like this every few hours or so. I don't know
if I'm just misinterpreting these, but it looks like maybe mongos
isn't finding a master in my new replica sets ("ismaster: false").
Could this be the problem?

Thu Feb 17 09:25:31 [conn1763] checkmaster: mongodb07.example.com:
27018 { setName: "3", ismaster: false, secondary: false, hosts:
[ "mongodb07.example.com:27018", "mong
odb08.example.com:27018" ], arbiters: [ "mongodb09.example.com:
27018" ], ok: 1.0 }
checkmaster: mongodb08.example.com:27018 { setName: "3", ismaster:
false, secondary: false, hosts: [ "mongodb08.example.com:27018",
"mongodb07.example.com:27018" ],
arbiters: [ "mongodb09.example.com:27018" ], ok: 1.0 }
MessagingPort recv() errno:110 Connection timed out
10.124.193.83:38328

I've tried lots of things to try to kickstart the balancer, including
rebooting various mongos processes. I've also tried stopping and
restarting the balancer with these commands, but this didn't seem to
do anything:

db.settings.update( { _id: "balancer" }, { $set : { stopped:
true } } , true );
db.settings.update( { _id: "balancer" }, { $set : { stopped:
false } } , true );

Well, I'm just about out of ideas now, so any advice anyone can give
would be greatly appreciated.

Michael Conigliaro

unread,
Feb 17, 2011, 3:40:55 PM2/17/11
to mongodb-user
I just found this thread: http://groups.google.com/group/mongodb-user/browse_thread/thread/fc7102701dbf9610?tvc=2

Which links to this JIRA issue: https://jira.mongodb.org/browse/SERVER-2523

Could this be my problem?

- Mike

Eliot Horowitz

unread,
Feb 17, 2011, 4:27:29 PM2/17/11
to mongod...@googlegroups.com
Can you send
db.printShardingStatus()
mongos log

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

Michael Conigliaro

unread,
Feb 17, 2011, 5:15:31 PM2/17/11
to mongodb-user
Thanks for the reply Eliot. Here's the output from
db.printShardingStatus() (server names changed to protect the innocent
=) I'm assuming all this "on : 2" business means the data is on shard
2 (which is my original shard). The fact that I see ~150 chunks seems
like a good sign to me:

> db.printShardingStatus()
--- Sharding Status --- s
sharding version: { "_id" : 1, "version" : 3 }
shards:
{
"_id" : "1",
"host" : "1/mongodb01.example.com:27018,mongodb02.example.com:27018"
}
{
"_id" : "2",
"host" : "2/mongodb04.example.com:27018,mongodb05.example.com:27018"
}
{
"_id" : "3",
"host" : "3/mongodb07.example.com:27018,mongodb08.example.com:27018"
}
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "test", "partitioned" : false, "primary" : "2" }
{ "_id" : "example", "partitioned" : true, "primary" : "2" }
example.userEventsJournal chunks:
{ "userId" : { $minKey : 1 } } -->> { "userId" : "01c8a1f0-
c130-5626-d2ef-68de3f8b73da" } on : 2 { "t" : 1000, "i" : 1 }
{ "userId" : "01c8a1f0-c130-5626-d2ef-68de3f8b73da" } -->>
{ "userId" : "038a9a57-1bf5-1582-1433-7ef7d1fe4ef7" } on : 2 { "t" :
1000, "i" : 2 }
{ "userId" : "038a9a57-1bf5-1582-1433-7ef7d1fe4ef7" } -->>
{ "userId" : "054e48f1-202a-2cc1-e73c-ce87663a4969" } on : 2 { "t" :
1000, "i" : 3 }
{ "userId" : "054e48f1-202a-2cc1-e73c-ce87663a4969" } -->>
{ "userId" : "071981a0-2ea1-f928-cbb5-e9c9665a1ecb" } on : 2 { "t" :
1000, "i" : 4 }
{ "userId" : "071981a0-2ea1-f928-cbb5-e9c9665a1ecb" } -->>
{ "userId" : "08cd3676-dcfa-f34d-6402-aa91d68d3855" } on : 2 { "t" :
1000, "i" : 5 }
{ "userId" : "08cd3676-dcfa-f34d-6402-aa91d68d3855" } -->>
{ "userId" : "0a876c3c-926e-0bef-aa82-cd9ab17113df" } on : 2 { "t" :
1000, "i" : 6 }
{ "userId" : "0a876c3c-926e-0bef-aa82-cd9ab17113df" } -->>
{ "userId" : "0c38e3e6-70a4-f3d2-f117-3b24bafdb38c" } on : 2 { "t" :
1000, "i" : 7 }
{ "userId" : "0c38e3e6-70a4-f3d2-f117-3b24bafdb38c" } -->>
{ "userId" : "0dff3343-d9a1-b1ea-f3af-5afffb55bfcc" } on : 2 { "t" :
1000, "i" : 8 }
{ "userId" : "0dff3343-d9a1-b1ea-f3af-5afffb55bfcc" } -->>
{ "userId" : "0fb6afad-6497-c684-6a5a-7df174145456" } on : 2 { "t" :
1000, "i" : 9 }
{ "userId" : "0fb6afad-6497-c684-6a5a-7df174145456" } -->>
{ "userId" : "1164ceb5-1f23-b7f3-f661-4216d47256b7" } on : 2 { "t" :
1000, "i" : 10 }
{ "userId" : "1164ceb5-1f23-b7f3-f661-4216d47256b7" } -->>
{ "userId" : "1321c72a-1bd9-ad02-539c-51bf1b596a9c" } on : 2 { "t" :
1000, "i" : 11 }
{ "userId" : "1321c72a-1bd9-ad02-539c-51bf1b596a9c" } -->>
{ "userId" : "14e0f03c-7e66-ebd4-ee33-30004206a109" } on : 2 { "t" :
1000, "i" : 12 }
{ "userId" : "14e0f03c-7e66-ebd4-ee33-30004206a109" } -->>
{ "userId" : "166365e3-3633-cabd-3eeb-fde6592bea9e" } on : 2 { "t" :
1000, "i" : 13 }
{ "userId" : "166365e3-3633-cabd-3eeb-fde6592bea9e" } -->>
{ "userId" : "1824c8e4-6bcf-1db3-bfe5-dc9c2e6e5dc2" } on : 2 { "t" :
1000, "i" : 14 }
{ "userId" : "1824c8e4-6bcf-1db3-bfe5-dc9c2e6e5dc2" } -->>
{ "userId" : "19f2d70a-f118-22b2-ab94-29efb7fd7e8a" } on : 2 { "t" :
1000, "i" : 15 }
{ "userId" : "19f2d70a-f118-22b2-ab94-29efb7fd7e8a" } -->>
{ "userId" : "1bb4d278-e2d8-0151-9a30-3e6ed218cf0f" } on : 2 { "t" :
1000, "i" : 16 }
{ "userId" : "1bb4d278-e2d8-0151-9a30-3e6ed218cf0f" } -->>
{ "userId" : "1d69477d-b2dc-448f-f38c-cb18563a705b" } on : 2 { "t" :
1000, "i" : 17 }
{ "userId" : "1d69477d-b2dc-448f-f38c-cb18563a705b" } -->>
{ "userId" : "1f1c5033-2bc5-cf07-3e38-2cb52e36e8e4" } on : 2 { "t" :
1000, "i" : 18 }
{ "userId" : "1f1c5033-2bc5-cf07-3e38-2cb52e36e8e4" } -->>
{ "userId" : "20d6214c-f5e2-8790-3ba2-2a974c5e0222" } on : 2 { "t" :
1000, "i" : 19 }
{ "userId" : "20d6214c-f5e2-8790-3ba2-2a974c5e0222" } -->>
{ "userId" : "22a0a54d-301d-6924-7ae8-24772a249d56" } on : 2 { "t" :
1000, "i" : 20 }
{ "userId" : "22a0a54d-301d-6924-7ae8-24772a249d56" } -->>
{ "userId" : "245a6ade-5926-ddc7-105d-a85e453454e2" } on : 2 { "t" :
1000, "i" : 21 }
{ "userId" : "245a6ade-5926-ddc7-105d-a85e453454e2" } -->>
{ "userId" : "26161405-1b89-667c-d864-3ad34a9c7b2b" } on : 2 { "t" :
1000, "i" : 22 }
{ "userId" : "26161405-1b89-667c-d864-3ad34a9c7b2b" } -->>
{ "userId" : "27dd1f01-5de6-f4d3-4f82-466d5f8fc961" } on : 2 { "t" :
1000, "i" : 23 }
{ "userId" : "27dd1f01-5de6-f4d3-4f82-466d5f8fc961" } -->>
{ "userId" : "29929bb4-963f-e504-99d1-2e2d0eaa0f84" } on : 2 { "t" :
1000, "i" : 24 }
{ "userId" : "29929bb4-963f-e504-99d1-2e2d0eaa0f84" } -->>
{ "userId" : "2b619a0b-407e-0de7-42d4-f8a27c77e726" } on : 2 { "t" :
1000, "i" : 25 }
{ "userId" : "2b619a0b-407e-0de7-42d4-f8a27c77e726" } -->>
{ "userId" : "2d2b696e-fa9e-5d4b-e049-e9e22af4d34f" } on : 2 { "t" :
1000, "i" : 26 }
{ "userId" : "2d2b696e-fa9e-5d4b-e049-e9e22af4d34f" } -->>
{ "userId" : "2ef89578-2827-6677-1704-bcc096407beb" } on : 2 { "t" :
1000, "i" : 27 }
{ "userId" : "2ef89578-2827-6677-1704-bcc096407beb" } -->>
{ "userId" : "30b01a5e-cc05-8cec-ffe2-63303f08923c" } on : 2 { "t" :
1000, "i" : 28 }
{ "userId" : "30b01a5e-cc05-8cec-ffe2-63303f08923c" } -->>
{ "userId" : "3249eb0a-d398-163f-376b-c0f16d7b7a46" } on : 2 { "t" :
1000, "i" : 29 }
{ "userId" : "3249eb0a-d398-163f-376b-c0f16d7b7a46" } -->>
{ "userId" : "33fdfaa6-d6ce-ada7-142a-c31d9ee85c80" } on : 2 { "t" :
1000, "i" : 30 }
{ "userId" : "33fdfaa6-d6ce-ada7-142a-c31d9ee85c80" } -->>
{ "userId" : "35a99f3b-1869-24f7-4e0d-467e7a2f9074" } on : 2 { "t" :
1000, "i" : 31 }
{ "userId" : "35a99f3b-1869-24f7-4e0d-467e7a2f9074" } -->>
{ "userId" : "377b5f9d-2b3c-dbbf-faf2-2f14f0e57069" } on : 2 { "t" :
1000, "i" : 32 }
{ "userId" : "377b5f9d-2b3c-dbbf-faf2-2f14f0e57069" } -->>
{ "userId" : "394030ad-d0a4-6df7-1dd5-bcd28e8035d5" } on : 2 { "t" :
1000, "i" : 33 }
{ "userId" : "394030ad-d0a4-6df7-1dd5-bcd28e8035d5" } -->>
{ "userId" : "3aeb5966-4ba4-be43-2982-6e0db9810017" } on : 2 { "t" :
1000, "i" : 34 }
{ "userId" : "3aeb5966-4ba4-be43-2982-6e0db9810017" } -->>
{ "userId" : "3cb14841-2f4a-5a07-866c-f547bd1361a4" } on : 2 { "t" :
1000, "i" : 35 }
{ "userId" : "3cb14841-2f4a-5a07-866c-f547bd1361a4" } -->>
{ "userId" : "3e723478-17e5-7827-1598-a34825789cee" } on : 2 { "t" :
1000, "i" : 36 }
{ "userId" : "3e723478-17e5-7827-1598-a34825789cee" } -->>
{ "userId" : "4034a2d9-ff38-89d4-57ec-1d8c100572b8" } on : 2 { "t" :
1000, "i" : 37 }
{ "userId" : "4034a2d9-ff38-89d4-57ec-1d8c100572b8" } -->>
{ "userId" : "41c8398f-e8e0-4b9a-7af8-b47c8f1e8157" } on : 2 { "t" :
1000, "i" : 38 }
{ "userId" : "41c8398f-e8e0-4b9a-7af8-b47c8f1e8157" } -->>
{ "userId" : "4392ae34-8119-78fa-00c0-abd3410bd4b1" } on : 2 { "t" :
1000, "i" : 39 }
{ "userId" : "4392ae34-8119-78fa-00c0-abd3410bd4b1" } -->>
{ "userId" : "4561975d-beff-01bd-aee2-ad278de95942" } on : 2 { "t" :
1000, "i" : 40 }
{ "userId" : "4561975d-beff-01bd-aee2-ad278de95942" } -->>
{ "userId" : "4711fa4b-cc9c-36a4-c738-da0571c1e128" } on : 2 { "t" :
1000, "i" : 41 }
{ "userId" : "4711fa4b-cc9c-36a4-c738-da0571c1e128" } -->>
{ "userId" : "48cfd7d0-bf16-5eea-5db1-2e5051bfb849" } on : 2 { "t" :
1000, "i" : 42 }
{ "userId" : "48cfd7d0-bf16-5eea-5db1-2e5051bfb849" } -->>
{ "userId" : "4a82a59e-f4d2-1041-a53c-b1b07011c2eb" } on : 2 { "t" :
1000, "i" : 43 }
{ "userId" : "4a82a59e-f4d2-1041-a53c-b1b07011c2eb" } -->>
{ "userId" : "4c38fe9f-0a35-8f5f-43b9-27c1045d3092" } on : 2 { "t" :
1000, "i" : 44 }
{ "userId" : "4c38fe9f-0a35-8f5f-43b9-27c1045d3092" } -->>
{ "userId" : "4df40b1d-0757-0752-3efd-e26ec574f7da" } on : 2 { "t" :
1000, "i" : 45 }
{ "userId" : "4df40b1d-0757-0752-3efd-e26ec574f7da" } -->>
{ "userId" : "4fae9e18-fcb0-1ee5-07d7-a0d9cdeddbae" } on : 2 { "t" :
1000, "i" : 46 }
{ "userId" : "4fae9e18-fcb0-1ee5-07d7-a0d9cdeddbae" } -->>
{ "userId" : "5158026a-e870-8d2e-db5d-47b1a874a016" } on : 2 { "t" :
1000, "i" : 47 }
{ "userId" : "5158026a-e870-8d2e-db5d-47b1a874a016" } -->>
{ "userId" : "532908d0-9ebf-c7e5-2513-21bdcd928436" } on : 2 { "t" :
1000, "i" : 48 }
{ "userId" : "532908d0-9ebf-c7e5-2513-21bdcd928436" } -->>
{ "userId" : "54d66509-08a1-9f3d-e055-e4b9c67d56fa" } on : 2 { "t" :
1000, "i" : 49 }
{ "userId" : "54d66509-08a1-9f3d-e055-e4b9c67d56fa" } -->>
{ "userId" : "5696e06c-01eb-e3a7-e41e-395062b5df3f" } on : 2 { "t" :
1000, "i" : 50 }
{ "userId" : "5696e06c-01eb-e3a7-e41e-395062b5df3f" } -->>
{ "userId" : "585da63a-abab-32a4-df74-c38e4d4d3686" } on : 2 { "t" :
1000, "i" : 51 }
{ "userId" : "585da63a-abab-32a4-df74-c38e4d4d3686" } -->>
{ "userId" : "5a136183-9b0c-6a67-a212-975edd1fbf67" } on : 2 { "t" :
1000, "i" : 52 }
{ "userId" : "5a136183-9b0c-6a67-a212-975edd1fbf67" } -->>
{ "userId" : "5bd2c97c-fade-8bdf-aede-44900529685b" } on : 2 { "t" :
1000, "i" : 53 }
{ "userId" : "5bd2c97c-fade-8bdf-aede-44900529685b" } -->>
{ "userId" : "5d8ea1a9-9317-9fa0-cb99-ebcce9851c90" } on : 2 { "t" :
1000, "i" : 54 }
{ "userId" : "5d8ea1a9-9317-9fa0-cb99-ebcce9851c90" } -->>
{ "userId" : "5f44a55c-a19a-9270-6721-64fc8e4a3da9" } on : 2 { "t" :
1000, "i" : 55 }
{ "userId" : "5f44a55c-a19a-9270-6721-64fc8e4a3da9" } -->>
{ "userId" : "610396f2-e6ac-e02e-bc37-1d76f198549b" } on : 2 { "t" :
1000, "i" : 56 }
{ "userId" : "610396f2-e6ac-e02e-bc37-1d76f198549b" } -->>
{ "userId" : "62c081b4-aa88-baf2-2d74-4e092b55c1e2" } on : 2 { "t" :
1000, "i" : 57 }
{ "userId" : "62c081b4-aa88-baf2-2d74-4e092b55c1e2" } -->>
{ "userId" : "64475277-b2ea-ae40-f36b-4f6813148b8e" } on : 2 { "t" :
1000, "i" : 58 }
{ "userId" : "64475277-b2ea-ae40-f36b-4f6813148b8e" } -->>
{ "userId" : "66149c92-b3ce-0463-61a6-ff73fe50b87e" } on : 2 { "t" :
1000, "i" : 59 }
{ "userId" : "66149c92-b3ce-0463-61a6-ff73fe50b87e" } -->>
{ "userId" : "67d577b6-112f-b007-6fa8-1edad537d725" } on : 2 { "t" :
1000, "i" : 60 }
{ "userId" : "67d577b6-112f-b007-6fa8-1edad537d725" } -->>
{ "userId" : "69ab2cd0-df9f-d5c6-c845-0d51108b5f73" } on : 2 { "t" :
1000, "i" : 61 }
{ "userId" : "69ab2cd0-df9f-d5c6-c845-0d51108b5f73" } -->>
{ "userId" : "6b215f25-b5bd-835f-e247-44ce0adfacee" } on : 2 { "t" :
1000, "i" : 62 }
{ "userId" : "6b215f25-b5bd-835f-e247-44ce0adfacee" } -->>
{ "userId" : "6cd67943-cd7f-7822-183b-7c69d039c99e" } on : 2 { "t" :
1000, "i" : 63 }
{ "userId" : "6cd67943-cd7f-7822-183b-7c69d039c99e" } -->>
{ "userId" : "6eacd4b7-b8c6-7dbe-6b9f-86bb4bef91f3" } on : 2 { "t" :
1000, "i" : 64 }
{ "userId" : "6eacd4b7-b8c6-7dbe-6b9f-86bb4bef91f3" } -->>
{ "userId" : "706ea3ba-9fbc-bfa3-27cd-7f9a05d0b90c" } on : 2 { "t" :
1000, "i" : 65 }
{ "userId" : "706ea3ba-9fbc-bfa3-27cd-7f9a05d0b90c" } -->>
{ "userId" : "72240579-b781-c861-b0d7-9349446fc732" } on : 2 { "t" :
1000, "i" : 66 }
{ "userId" : "72240579-b781-c861-b0d7-9349446fc732" } -->>
{ "userId" : "73e4b9ef-ba5f-9fd8-07ba-3ac6308ac86b" } on : 2 { "t" :
1000, "i" : 67 }
{ "userId" : "73e4b9ef-ba5f-9fd8-07ba-3ac6308ac86b" } -->>
{ "userId" : "75aa50f0-865a-2a23-3b85-030accb1b825" } on : 2 { "t" :
1000, "i" : 68 }
{ "userId" : "75aa50f0-865a-2a23-3b85-030accb1b825" } -->>
{ "userId" : "774e6e48-57d0-3801-32d2-f654b01ace96" } on : 2 { "t" :
1000, "i" : 69 }
{ "userId" : "774e6e48-57d0-3801-32d2-f654b01ace96" } -->>
{ "userId" : "79092857-c64f-9cc8-14c5-22cdd7504586" } on : 2 { "t" :
1000, "i" : 70 }
{ "userId" : "79092857-c64f-9cc8-14c5-22cdd7504586" } -->>
{ "userId" : "7acf8fdb-d75c-0a32-fdcb-b9bab07da363" } on : 2 { "t" :
1000, "i" : 71 }
{ "userId" : "7acf8fdb-d75c-0a32-fdcb-b9bab07da363" } -->>
{ "userId" : "7c834c91-0eed-8c52-021e-2acfa1aec70f" } on : 2 { "t" :
1000, "i" : 72 }
{ "userId" : "7c834c91-0eed-8c52-021e-2acfa1aec70f" } -->>
{ "userId" : "7e39c8b1-9ff3-11f5-9847-76dc199e8ed0" } on : 2 { "t" :
1000, "i" : 73 }
{ "userId" : "7e39c8b1-9ff3-11f5-9847-76dc199e8ed0" } -->>
{ "userId" : "7feb2ed6-69ec-8de4-0b43-f3c874870592" } on : 2 { "t" :
1000, "i" : 74 }
{ "userId" : "7feb2ed6-69ec-8de4-0b43-f3c874870592" } -->>
{ "userId" : "81cbab62-2fe2-a69a-9624-1a50e0717756" } on : 2 { "t" :
1000, "i" : 75 }
{ "userId" : "81cbab62-2fe2-a69a-9624-1a50e0717756" } -->>
{ "userId" : "839ce3e3-d794-96e4-4702-9910772a70c8" } on : 2 { "t" :
1000, "i" : 76 }
{ "userId" : "839ce3e3-d794-96e4-4702-9910772a70c8" } -->>
{ "userId" : "85685dc7-b1e0-29c6-e68d-f765ca7a3798" } on : 2 { "t" :
1000, "i" : 77 }
{ "userId" : "85685dc7-b1e0-29c6-e68d-f765ca7a3798" } -->>
{ "userId" : "873633b8-02be-1d1b-219f-4aa2cf694c2a" } on : 2 { "t" :
1000, "i" : 78 }
{ "userId" : "873633b8-02be-1d1b-219f-4aa2cf694c2a" } -->>
{ "userId" : "8905b1c3-43b0-f863-cd49-e3f3299ecfdb" } on : 2 { "t" :
1000, "i" : 79 }
{ "userId" : "8905b1c3-43b0-f863-cd49-e3f3299ecfdb" } -->>
{ "userId" : "8ac7e6ac-0fe7-ff3a-3f5f-16d19e6c74cf" } on : 2 { "t" :
1000, "i" : 80 }
{ "userId" : "8ac7e6ac-0fe7-ff3a-3f5f-16d19e6c74cf" } -->>
{ "userId" : "8c893a74-88db-72cc-22df-fe39690c6554" } on : 2 { "t" :
1000, "i" : 81 }
{ "userId" : "8c893a74-88db-72cc-22df-fe39690c6554" } -->>
{ "userId" : "8e42def7-1279-ea76-562b-207aa25a95a0" } on : 2 { "t" :
1000, "i" : 82 }
{ "userId" : "8e42def7-1279-ea76-562b-207aa25a95a0" } -->>
{ "userId" : "8fe81cc4-8e15-b555-2a22-559d375d92b4" } on : 2 { "t" :
1000, "i" : 83 }
{ "userId" : "8fe81cc4-8e15-b555-2a22-559d375d92b4" } -->>
{ "userId" : "918001c4-c864-58a3-be3d-16589de90564" } on : 2 { "t" :
1000, "i" : 84 }
{ "userId" : "918001c4-c864-58a3-be3d-16589de90564" } -->>
{ "userId" : "9358f321-6457-14c0-b5aa-8e08751ed866" } on : 2 { "t" :
1000, "i" : 85 }
{ "userId" : "9358f321-6457-14c0-b5aa-8e08751ed866" } -->>
{ "userId" : "95247c82-0225-1d01-fe68-3f90af3171dd" } on : 2 { "t" :
1000, "i" : 86 }
{ "userId" : "95247c82-0225-1d01-fe68-3f90af3171dd" } -->>
{ "userId" : "96ee2cf1-53fd-dc1f-e817-9d9ac71660ed" } on : 2 { "t" :
1000, "i" : 87 }
{ "userId" : "96ee2cf1-53fd-dc1f-e817-9d9ac71660ed" } -->>
{ "userId" : "9896ab94-d3ed-70b4-a4bb-e14f2b57c17b" } on : 2 { "t" :
1000, "i" : 88 }
{ "userId" : "9896ab94-d3ed-70b4-a4bb-e14f2b57c17b" } -->>
{ "userId" : "9a5bedfc-2caf-7fd2-33a1-f4c1590f13b2" } on : 2 { "t" :
1000, "i" : 89 }
{ "userId" : "9a5bedfc-2caf-7fd2-33a1-f4c1590f13b2" } -->>
{ "userId" : "9c25dc2b-d5b7-1441-5738-516563a00a8a" } on : 2 { "t" :
1000, "i" : 90 }
{ "userId" : "9c25dc2b-d5b7-1441-5738-516563a00a8a" } -->>
{ "userId" : "9de4f2c6-d866-54aa-87b1-56a837cad388" } on : 2 { "t" :
1000, "i" : 91 }
{ "userId" : "9de4f2c6-d866-54aa-87b1-56a837cad388" } -->>
{ "userId" : "9fa94068-4d83-a659-0ea9-9773c97c0347" } on : 2 { "t" :
1000, "i" : 92 }
{ "userId" : "9fa94068-4d83-a659-0ea9-9773c97c0347" } -->>
{ "userId" : "a1622497-c32d-c5a8-b2c9-cd6998c2606f" } on : 2 { "t" :
1000, "i" : 93 }
{ "userId" : "a1622497-c32d-c5a8-b2c9-cd6998c2606f" } -->>
{ "userId" : "a336a397-8b64-5cdd-7649-80eeeeccaa99" } on : 2 { "t" :
1000, "i" : 94 }
{ "userId" : "a336a397-8b64-5cdd-7649-80eeeeccaa99" } -->>
{ "userId" : "a4ceee56-5cf2-b29d-7bc1-0b9eb2eb63ee" } on : 2 { "t" :
1000, "i" : 95 }
{ "userId" : "a4ceee56-5cf2-b29d-7bc1-0b9eb2eb63ee" } -->>
{ "userId" : "a66e31d9-f5ce-8c5d-9c1f-bf5b4104689c" } on : 2 { "t" :
1000, "i" : 96 }
{ "userId" : "a66e31d9-f5ce-8c5d-9c1f-bf5b4104689c" } -->>
{ "userId" : "a82c68ab-a8bf-bc1e-0e4f-e7710f3413e5" } on : 2 { "t" :
1000, "i" : 97 }
{ "userId" : "a82c68ab-a8bf-bc1e-0e4f-e7710f3413e5" } -->>
{ "userId" : "a97eaffd-5711-e8ba-409d-6a6d719a7d40" } on : 2 { "t" :
1000, "i" : 98 }
{ "userId" : "a97eaffd-5711-e8ba-409d-6a6d719a7d40" } -->>
{ "userId" : "ab488837-6444-87cb-796d-15b66c4c73a8" } on : 2 { "t" :
1000, "i" : 99 }
{ "userId" : "ab488837-6444-87cb-796d-15b66c4c73a8" } -->>
{ "userId" : "ad10594a-66d5-8b4f-42a7-c5792ce78deb" } on : 2 { "t" :
1000, "i" : 100 }
{ "userId" : "ad10594a-66d5-8b4f-42a7-c5792ce78deb" } -->>
{ "userId" : "aed4aeb9-1b38-548e-ae38-3a0d30d64e53" } on : 2 { "t" :
1000, "i" : 101 }
{ "userId" : "aed4aeb9-1b38-548e-ae38-3a0d30d64e53" } -->>
{ "userId" : "b0a0e585-f9a5-7041-ebed-90aa6bbc0e79" } on : 2 { "t" :
1000, "i" : 102 }
{ "userId" : "b0a0e585-f9a5-7041-ebed-90aa6bbc0e79" } -->>
{ "userId" : "b26fc119-0f59-eeeb-1255-d5918563ed59" } on : 2 { "t" :
1000, "i" : 103 }
{ "userId" : "b26fc119-0f59-eeeb-1255-d5918563ed59" } -->>
{ "userId" : "b443178d-5e85-a1d6-f617-3ab9b0b34a8b" } on : 2 { "t" :
1000, "i" : 104 }
{ "userId" : "b443178d-5e85-a1d6-f617-3ab9b0b34a8b" } -->>
{ "userId" : "b5ff266d-a07c-2a37-1c64-a6fb73502da3" } on : 2 { "t" :
1000, "i" : 105 }
{ "userId" : "b5ff266d-a07c-2a37-1c64-a6fb73502da3" } -->>
{ "userId" : "b7b0713e-ba0c-4ecd-0b08-32bc95552c7b" } on : 2 { "t" :
1000, "i" : 106 }
{ "userId" : "b7b0713e-ba0c-4ecd-0b08-32bc95552c7b" } -->>
{ "userId" : "b959ecb0-51b6-c7ef-0b21-26a00e7f9cd7" } on : 2 { "t" :
1000, "i" : 107 }
{ "userId" : "b959ecb0-51b6-c7ef-0b21-26a00e7f9cd7" } -->>
{ "userId" : "bb214fcd-6e1d-5751-0b17-3075b946159a" } on : 2 { "t" :
1000, "i" : 108 }
{ "userId" : "bb214fcd-6e1d-5751-0b17-3075b946159a" } -->>
{ "userId" : "bcea86d7-6c60-ad94-8d21-16f58aea90c3" } on : 2 { "t" :
1000, "i" : 109 }
{ "userId" : "bcea86d7-6c60-ad94-8d21-16f58aea90c3" } -->>
{ "userId" : "bec3d9ac-fdd6-b289-dc0e-369bdf06d2b6" } on : 2 { "t" :
1000, "i" : 110 }
{ "userId" : "bec3d9ac-fdd6-b289-dc0e-369bdf06d2b6" } -->>
{ "userId" : "c08b5dd0-e6ed-3c20-2b64-2def5f81d5d2" } on : 2 { "t" :
1000, "i" : 111 }
{ "userId" : "c08b5dd0-e6ed-3c20-2b64-2def5f81d5d2" } -->>
{ "userId" : "c25c18bb-8c59-6ba3-7baa-7ff31594aa7d" } on : 2 { "t" :
1000, "i" : 112 }
{ "userId" : "c25c18bb-8c59-6ba3-7baa-7ff31594aa7d" } -->>
{ "userId" : "c4304742-85f9-c514-7a73-1d19725668ba" } on : 2 { "t" :
1000, "i" : 113 }
{ "userId" : "c4304742-85f9-c514-7a73-1d19725668ba" } -->>
{ "userId" : "c5fc6884-4b3a-1ecd-959a-e93139d5f15c" } on : 2 { "t" :
1000, "i" : 114 }
{ "userId" : "c5fc6884-4b3a-1ecd-959a-e93139d5f15c" } -->>
{ "userId" : "c7b958c2-e68e-fb42-8020-ab7575b99f76" } on : 2 { "t" :
1000, "i" : 115 }
{ "userId" : "c7b958c2-e68e-fb42-8020-ab7575b99f76" } -->>
{ "userId" : "c96c9395-bfb4-7b50-c8e0-33cb00d0bba0" } on : 2 { "t" :
1000, "i" : 116 }
{ "userId" : "c96c9395-bfb4-7b50-c8e0-33cb00d0bba0" } -->>
{ "userId" : "cb256f66-fc45-c300-6ae9-8199a95dfaf1" } on : 2 { "t" :
1000, "i" : 117 }
{ "userId" : "cb256f66-fc45-c300-6ae9-8199a95dfaf1" } -->>
{ "userId" : "ccd2a33a-0654-3488-0f5a-098cb31548ff" } on : 2 { "t" :
1000, "i" : 118 }
{ "userId" : "ccd2a33a-0654-3488-0f5a-098cb31548ff" } -->>
{ "userId" : "ce9735d8-0a22-6652-1d59-533942ef12f9" } on : 2 { "t" :
1000, "i" : 119 }
{ "userId" : "ce9735d8-0a22-6652-1d59-533942ef12f9" } -->>
{ "userId" : "d049a5c9-140a-7604-5a07-b741458cc9b3" } on : 2 { "t" :
1000, "i" : 120 }
{ "userId" : "d049a5c9-140a-7604-5a07-b741458cc9b3" } -->>
{ "userId" : "d20089f3-e132-0ec0-4a90-44596d3e3963" } on : 2 { "t" :
1000, "i" : 121 }
{ "userId" : "d20089f3-e132-0ec0-4a90-44596d3e3963" } -->>
{ "userId" : "d3a1f952-8423-2ec9-4db8-061784784688" } on : 2 { "t" :
1000, "i" : 122 }
{ "userId" : "d3a1f952-8423-2ec9-4db8-061784784688" } -->>
{ "userId" : "d55d9257-4f0e-a918-67e6-98892750acf1" } on : 2 { "t" :
1000, "i" : 123 }
{ "userId" : "d55d9257-4f0e-a918-67e6-98892750acf1" } -->>
{ "userId" : "d6fbce6e-794e-24d0-496c-ecf17650ad76" } on : 2 { "t" :
1000, "i" : 124 }
{ "userId" : "d6fbce6e-794e-24d0-496c-ecf17650ad76" } -->>
{ "userId" : "d8c4cdd8-0993-ae30-d346-8376cefa2110" } on : 2 { "t" :
1000, "i" : 125 }
{ "userId" : "d8c4cdd8-0993-ae30-d346-8376cefa2110" } -->>
{ "userId" : "da95426d-bda4-281f-ced2-376eed8635c0" } on : 2 { "t" :
1000, "i" : 126 }
{ "userId" : "da95426d-bda4-281f-ced2-376eed8635c0" } -->>
{ "userId" : "dc642932-b3d1-763c-4779-7cc0e3ee00c2" } on : 2 { "t" :
1000, "i" : 127 }
{ "userId" : "dc642932-b3d1-763c-4779-7cc0e3ee00c2" } -->>
{ "userId" : "de348c7f-8d6f-c1c8-4ad0-1ca0e7d46619" } on : 2 { "t" :
1000, "i" : 128 }
{ "userId" : "de348c7f-8d6f-c1c8-4ad0-1ca0e7d46619" } -->>
{ "userId" : "e00aa360-41cf-b713-846d-9439cb22fb82" } on : 2 { "t" :
1000, "i" : 129 }
{ "userId" : "e00aa360-41cf-b713-846d-9439cb22fb82" } -->>
{ "userId" : "e1d8acc6-2f97-a422-f78f-7ae4cf78d79d" } on : 2 { "t" :
1000, "i" : 130 }
{ "userId" : "e1d8acc6-2f97-a422-f78f-7ae4cf78d79d" } -->>
{ "userId" : "e3796caf-8418-eb27-ad82-cef8c5980018" } on : 2 { "t" :
1000, "i" : 131 }
{ "userId" : "e3796caf-8418-eb27-ad82-cef8c5980018" } -->>
{ "userId" : "e543518f-edea-3cb2-8bef-e9ead0e304ab" } on : 2 { "t" :
1000, "i" : 132 }
{ "userId" : "e543518f-edea-3cb2-8bef-e9ead0e304ab" } -->>
{ "userId" : "e71b38c3-7a59-5e86-69fa-4cdc8204beff" } on : 2 { "t" :
1000, "i" : 133 }
{ "userId" : "e71b38c3-7a59-5e86-69fa-4cdc8204beff" } -->>
{ "userId" : "e8eb6658-23e0-b26e-7a54-f09c127935d4" } on : 2 { "t" :
1000, "i" : 134 }
{ "userId" : "e8eb6658-23e0-b26e-7a54-f09c127935d4" } -->>
{ "userId" : "eabb2ac6-4382-1d15-b717-0e197055a785" } on : 2 { "t" :
1000, "i" : 135 }
{ "userId" : "eabb2ac6-4382-1d15-b717-0e197055a785" } -->>
{ "userId" : "ec6fc430-1898-83c3-3c40-44413744e3a0" } on : 2 { "t" :
1000, "i" : 136 }
{ "userId" : "ec6fc430-1898-83c3-3c40-44413744e3a0" } -->>
{ "userId" : "ee36087d-f7b1-4f1a-88e0-04b1f8b6ecdd" } on : 2 { "t" :
1000, "i" : 137 }
{ "userId" : "ee36087d-f7b1-4f1a-88e0-04b1f8b6ecdd" } -->>
{ "userId" : "efd904e6-abbf-eecb-b237-77fce09ae05d" } on : 2 { "t" :
1000, "i" : 138 }
{ "userId" : "efd904e6-abbf-eecb-b237-77fce09ae05d" } -->>
{ "userId" : "f1ae4566-b032-ac09-2617-38cb22e13210" } on : 2 { "t" :
1000, "i" : 139 }
{ "userId" : "f1ae4566-b032-ac09-2617-38cb22e13210" } -->>
{ "userId" : "f35236ba-9f7f-73c4-7672-76362d79da23" } on : 2 { "t" :
1000, "i" : 140 }
{ "userId" : "f35236ba-9f7f-73c4-7672-76362d79da23" } -->>
{ "userId" : "f5088a70-77d3-fe43-b4c2-9ef253b6d0a2" } on : 2 { "t" :
1000, "i" : 141 }
{ "userId" : "f5088a70-77d3-fe43-b4c2-9ef253b6d0a2" } -->>
{ "userId" : "f6ce73ae-1e61-f0fa-7836-44d5e71dce84" } on : 2 { "t" :
1000, "i" : 142 }
{ "userId" : "f6ce73ae-1e61-f0fa-7836-44d5e71dce84" } -->>
{ "userId" : "f89b2610-b428-3b38-0e28-f68f96d17eba" } on : 2 { "t" :
1000, "i" : 143 }
{ "userId" : "f89b2610-b428-3b38-0e28-f68f96d17eba" } -->>
{ "userId" : "fa603241-b9e6-bbaf-d3a4-78236dbe26b3" } on : 2 { "t" :
1000, "i" : 144 }
{ "userId" : "fa603241-b9e6-bbaf-d3a4-78236dbe26b3" } -->>
{ "userId" : "fc1dfc1c-5f88-b857-0a03-106db7fa5c8a" } on : 2 { "t" :
1000, "i" : 145 }
{ "userId" : "fc1dfc1c-5f88-b857-0a03-106db7fa5c8a" } -->>
{ "userId" : "fdde7459-088e-d394-e7c3-fff22a0dd2ce" } on : 2 { "t" :
1000, "i" : 146 }
{ "userId" : "fdde7459-088e-d394-e7c3-fff22a0dd2ce" } -->>
{ "userId" : "ffae9b95-af68-259e-50c3-0ef5adb80e18" } on : 2 { "t" :
1000, "i" : 147 }
{ "userId" : "ffae9b95-af68-259e-50c3-0ef5adb80e18" } -->>
{ "userId" : { $maxKey : 1 } } on : 2 { "t" : 1000, "i" : 148 }
{ "_id" : "local", "partitioned" : false, "primary" : "1" }


On Feb 17, 2:27 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> Can you send
> db.printShardingStatus()
> mongos log
>
> On Thu, Feb 17, 2011 at 3:40 PM, Michael Conigliaro
>
> <mi...@socialmedia.com> wrote:
> > I just found this thread:http://groups.google.com/group/mongodb-user/browse_thread/thread/fc71...

Michael Conigliaro

unread,
Feb 17, 2011, 5:26:15 PM2/17/11
to mongodb-user
I assume you don't want me to post my entire mongos.log (it's huge),
but the last few hours of logs look like this:

Thu Feb 17 22:02:04 [conn991] MessagingPort recv() got 1 bytes wanted
len=4
Thu Feb 17 22:03:15 [conn1557] MessagingPort recv() got 1 bytes wanted
len=4
Thu Feb 17 22:03:20 [conn665] MessagingPort recv() got 2 bytes wanted
len=4
Thu Feb 17 22:04:09 [conn3358] MessagingPort recv() got 3 bytes wanted
len=4
Thu Feb 17 22:04:18 [conn603] MessagingPort recv() got 3 bytes wanted
len=4
Thu Feb 17 22:05:20 [conn682] MessagingPort recv() got 1 bytes wanted
len=4
Thu Feb 17 22:05:26 [conn3332] MessagingPort recv() got 3 bytes wanted
len=4
Thu Feb 17 22:05:29 [conn991] MessagingPort recv() got 3 bytes wanted
len=4
Thu Feb 17 22:05:30 [conn5989] MessagingPort recv() got 2 bytes wanted
len=4
Thu Feb 17 22:05:38 [conn6254] MessagingPort recv() got 3 bytes wanted
len=4
Thu Feb 17 22:05:45 [conn658] MessagingPort recv() got 2 bytes wanted
len=4
Thu Feb 17 22:06:00 [conn658] MessagingPort recv() got 1 bytes wanted
len=4
Thu Feb 17 22:06:10 [conn682] MessagingPort recv() got 2 bytes wanted
len=4
Thu Feb 17 22:06:14 [conn1239] MessagingPort recv() got 3 bytes wanted
len=4
Thu Feb 17 22:06:46 [conn3330] MessagingPort recv() got 2 bytes wanted
len=4
Thu Feb 17 22:07:01 [conn6369] uncaught exception (remotes don't
match old [10.110.26.50:55486] new [10.220.179.95:55486])
(mongo::UserException) in PortMessageServer::threadRun, closing
connection
Thu Feb 17 22:07:15 [conn6012] MessagingPort recv() got 1 bytes wanted
len=4
Thu Feb 17 22:07:54 [conn1371] MessagingPort recv() got 3 bytes wanted
len=4
Thu Feb 17 22:08:27 [conn6013] MessagingPort recv() got 1 bytes wanted
len=4

If I go back a few hours, I see lots of the errors that I've posted
above:

Thu Feb 17 09:19:46 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 09:19:46 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:19:49 [conn1844] MessagingPort recv() errno:104
Connection reset by peer 10.116.217.156:49818
Thu Feb 17 09:19:49 [conn1844] SocketException: 9001 socket exception
Thu Feb 17 09:19:49 ERROR: couldn't unset sharding :( { errmsg: "not
master", ok: 0.0 }
Thu Feb 17 09:25:31 [conn1763] checkmaster: mongodb07.example.com:
27018 { setName: "3", ismaster: false, secondary: false, hosts:
[ "mongodb07.example.com:27018", "mongodb08.example.com:27018" ],
arbiters: [ "mongodb09.example.com:27018" ], ok: 1.0 }
checkmaster: mongodb08.example.com:27018 { setName: "3", ismaster:
false, secondary: false, hosts: [ "mongodb08.example.com:27018",
"mongodb07.example.com:27018" ], arbiters: [ "mongodb09.example.com:
27018" ], ok: 1.0 }
MessagingPort recv() errno:110 Connection timed out
10.124.193.83:38328
Thu Feb 17 09:25:31 [conn1763] SocketException: 9001 socket exception
Thu Feb 17 09:25:31 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 09:25:31 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:25:31 [conn1641] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:57562
Thu Feb 17 09:25:31 [conn1641] SocketException: 9001 socket exception
Thu Feb 17 09:25:31 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 09:25:31 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:25:31 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 09:25:31 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:25:31 [conn1648] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:33232
Thu Feb 17 09:25:31 [conn1648] SocketException: 9001 socket exception
Thu Feb 17 09:25:31 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 09:25:31 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:25:31 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 09:25:31 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:25:31 [conn1734] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:47631
Thu Feb 17 09:25:31 [conn1734] SocketException: 9001 socket exception
Thu Feb 17 09:25:31 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 09:25:31 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:25:31 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 09:25:31 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:25:31 [conn1705] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:47463
Thu Feb 17 09:25:31 [conn1705] SocketException: 9001 socket exception
Thu Feb 17 09:25:31 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 09:25:31 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:25:31 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 09:25:31 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:25:31 [conn1631] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:57433
Thu Feb 17 09:25:31 [conn1631] SocketException: 9001 socket exception
Thu Feb 17 09:25:36 [conn1333] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:46518
Thu Feb 17 09:25:36 [conn1333] SocketException: 9001 socket exception
Thu Feb 17 09:25:36 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 09:25:36 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:25:36 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 09:25:36 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:25:49 [conn1344] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:46537
Thu Feb 17 09:25:49 [conn1344] SocketException: 9001 socket exception
Thu Feb 17 09:25:49 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 09:25:49 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:25:49 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 09:25:49 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:25:56 [conn1936] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:47387
Thu Feb 17 09:25:56 [conn1936] SocketException: 9001 socket exception
Thu Feb 17 09:25:56 ERROR: couldn't unset sharding :( { errmsg: "not
master", ok: 0.0 }
Thu Feb 17 09:29:11 [conn1681] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:33684
Thu Feb 17 09:29:11 [conn1681] SocketException: 9001 socket exception
Thu Feb 17 09:31:16 [conn1359] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:46581
Thu Feb 17 09:31:16 [conn1359] SocketException: 9001 socket exception
Thu Feb 17 09:31:16 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 09:31:16 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:31:16 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 09:31:16 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:31:41 [conn1502] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:51864
Thu Feb 17 09:31:41 [conn1502] SocketException: 9001 socket exception
Thu Feb 17 09:31:41 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 09:31:41 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:32:12 [conn1919] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:47385
Thu Feb 17 09:32:12 [conn1919] SocketException: 9001 socket exception
Thu Feb 17 09:32:12 ERROR: couldn't unset sharding :( { errmsg: "not
master", ok: 0.0 }
Thu Feb 17 09:32:14 [conn1928] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:47386
Thu Feb 17 09:32:14 [conn1928] SocketException: 9001 socket exception
Thu Feb 17 09:32:14 ERROR: couldn't unset sharding :( { errmsg: "not
master", ok: 0.0 }
Thu Feb 17 09:32:14 [conn1907] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:47384
Thu Feb 17 09:32:14 [conn1907] SocketException: 9001 socket exception
Thu Feb 17 09:32:14 ERROR: couldn't unset sharding :( { errmsg: "not
master", ok: 0.0 }
Thu Feb 17 09:32:16 [conn1343] MessagingPort recv() errno:110
Connection timed out 10.124.193.83:46536
Thu Feb 17 09:32:16 [conn1343] SocketException: 9001 socket exception
Thu Feb 17 09:32:16 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 09:32:16 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 09:32:16 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 09:32:16 ERROR: couldn't unsert sharding : socket exception


On Feb 17, 2:27 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> Can you send
> db.printShardingStatus()
> mongos log
>
> On Thu, Feb 17, 2011 at 3:40 PM, Michael Conigliaro
>
> <mi...@socialmedia.com> wrote:
> > I just found this thread:http://groups.google.com/group/mongodb-user/browse_thread/thread/fc71...

Michael Conigliaro

unread,
Feb 17, 2011, 6:44:34 PM2/17/11
to mongodb-user
Just saw another burst of these. Are they related to my issue?
10.76.57.113 and 10.218.53.195 are the masters in my new (empty)
shards.

Thu Feb 17 23:21:15 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 23:21:15 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:21:15 [conn1817] MessagingPort recv() errno:104
Connection reset by peer 10.124.70.177:54691
Thu Feb 17 23:21:15 [conn1817] SocketException: 9001 socket exception
Thu Feb 17 23:21:15 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 23:21:15 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:21:15 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 23:21:15 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:21:15 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 23:21:15 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:21:15 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 23:21:15 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:21:15 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 23:21:15 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:21:15 ERROR: couldn't unset sharding :( { errmsg: "not
master", ok: 0.0 }
Thu Feb 17 23:21:15 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 23:21:15 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:21:15 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 23:21:15 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:21:15 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 23:21:15 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:21:15 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 23:21:15 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:21:48 [conn1239] MessagingPort recv() got 1 bytes wanted
len=4
Thu Feb 17 23:21:56 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 23:21:56 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:22:11 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 23:22:11 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:22:17 MessagingPort say send() errno:32 Broken pipe
10.76.57.113:27018
Thu Feb 17 23:22:17 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:22:40 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 23:22:40 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:23:02 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 23:23:02 ERROR: couldn't unsert sharding : socket exception
Thu Feb 17 23:23:09 MessagingPort say send() errno:32 Broken pipe
10.218.53.195:27018
Thu Feb 17 23:23:09 ERROR: couldn't unsert sharding : socket exception
> ...
>
> read more »

Eliot Horowitz

unread,
Feb 17, 2011, 8:04:16 PM2/17/11
to mongod...@googlegroups.com
It sounds like you've got a connectivity problem.
Are you sure all shards and config servers are up?

On Thu, Feb 17, 2011 at 6:44 PM, Michael Conigliaro

Michael Conigliaro

unread,
Feb 17, 2011, 8:15:01 PM2/17/11
to mongodb-user
Everything is definitely up. In my original message, I posted a mongo
session from when I tried connecting from the mongos/config machine to
one of the shards. I've actually done this test a bunch of times, and
I've never had it fail on me. I guess I should also mention that this
is all running on EC2 within the same security group. I've basically
enabled all communication (tcp/udp ports 1-65535) between all machines
in that group. I assume that if I didn't have this set up properly,
then nothing would be working, because I currently have all my
application servers communicating with my mongos machines (which are
able to access the original shard). The only problem is that it's
really slow. I desperately need to get this data sharded...

- Mike
> ...
>
> read more »

Eliot Horowitz

unread,
Feb 17, 2011, 8:17:50 PM2/17/11
to mongod...@googlegroups.com
At this point I would recommend trying the 1.7 nightly (soon to be 1.8 series).
There are a lot of improvements, as well as much better diagnostics.


On Thu, Feb 17, 2011 at 8:15 PM, Michael Conigliaro

Michael Conigliaro

unread,
Feb 17, 2011, 8:27:19 PM2/17/11
to mongodb-user
OK, is there any documentation for migrating to 1.7 yet? Any important
things I need to be aware of before attempting this?

Eliot Horowitz

unread,
Feb 17, 2011, 9:41:42 PM2/17/11
to mongod...@googlegroups.com
There are a few minor changes, the most notable is map/reduce.
If you aren't using map/reduce, should just be drop in.
Though I would test doing the upgrade in a sandbox first.
We'll be writing up detailed instructions in the next week or so.

On Thu, Feb 17, 2011 at 8:27 PM, Michael Conigliaro

Michael Conigliaro

unread,
Feb 17, 2011, 10:19:31 PM2/17/11
to mongodb-user
OK, thanks. We're not using map/reduce yet, so that's good. I just
found this documentation, so I'm probably going to take a backup and
give it a shot.

http://www.mongodb.org/display/DOCS/Upgrading+to+1.8.0
> ...
>
> read more »

Michael Conigliaro

unread,
Feb 17, 2011, 11:25:32 PM2/17/11
to mongodb-user
Just a quick update, I haven't attempted an upgrade yet, but I tried
moving a chunk manually, and it seemed to migrate over fine:

db.runCommand( { moveChunk : "example.userEventsJournal", find :
{ "userId" : { $minKey : 1 } }, to : "1" } )

I was then able to verify with mongostat that my queries are hitting
the other shard. Does this explain anything? At this point it seems
like I have everything wired up properly, but the balancer just isn't
doing anything.

- Mike
> ...
>
> read more »

Eliot Horowitz

unread,
Feb 17, 2011, 11:27:51 PM2/17/11
to mongod...@googlegroups.com
How many mongos do you have?
Can you search the logs for "Balancer"?

On Thu, Feb 17, 2011 at 11:25 PM, Michael Conigliaro

Michael Conigliaro

unread,
Feb 18, 2011, 12:39:45 AM2/18/11
to mongodb-user
3 mongos (one on each config server). Here are the last few hours
worth of logs. I believe a lot of these connection errors are from
when I restarted servers in desperation.

mongodb-config01

Fri Feb 18 03:44:33 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 03:44:33 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 03:44:33 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]
Fri Feb 18 03:45:04 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 03:45:04 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 03:45:04 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]
Fri Feb 18 03:52:04 [Balancer] dist_lock forcefully taking over from:
{ _id: "balancer", process: "mongodb-config01:1297892094:1804289383",
state: 1, ts: ObjectId('4d5c6cb8263a87e875a75a4c'), when: new
Date(1297902776888), who: "mongodb-
config01:1297892094:1804289383:Balancer:846930886", why: "doing
balance round" } elapsed minutes: 11
Fri Feb 18 03:52:04 [Balancer] ~ScopedDBConnection: _conn != null
Fri Feb 18 03:52:04 [Balancer] ~ScopedDBConnection: _conn != null
Fri Feb 18 03:52:04 [Balancer] caught exception while doing balance:
update not consistent
Fri Feb 18 03:52:35 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 03:52:35 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 03:52:35 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]
Fri Feb 18 04:07:43 [Balancer] trying reconnect to
mongodb01.example.com:27018
Fri Feb 18 04:07:52 [Balancer] reconnect mongodb01.example.com:27018
failed couldn't connect to server mongodb01.example.com:27018
Fri Feb 18 04:07:52 [Balancer] MessagingPort say send() errno:9 Bad
file descriptor 10.76.57.113:27018
Fri Feb 18 04:12:40 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 04:12:40 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 04:12:40 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]
Fri Feb 18 04:16:56 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 04:16:56 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 04:16:56 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]


mongodb-config02

Fri Feb 18 03:45:09 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 03:45:09 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 03:45:09 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]
Fri Feb 18 03:45:39 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 03:45:39 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 03:45:39 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]
Fri Feb 18 03:52:11 [Balancer] MessagingPort say send() errno:32
Broken pipe 10.112.34.236:27019
Fri Feb 18 03:52:11 [Balancer] ~ScopedDBConnection: _conn != null
Fri Feb 18 03:52:11 [Balancer] ~ScopedDBConnection: _conn != null
Fri Feb 18 03:52:11 [Balancer] caught exception while doing balance:
socket exception
Fri Feb 18 03:52:41 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 03:52:41 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 03:52:41 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]
Fri Feb 18 04:07:49 [Balancer] trying reconnect to
mongodb01.example.com:27018
Fri Feb 18 04:07:50 [Balancer] reconnect mongodb01.example.com:27018
failed couldn't connect to server mongodb01.example.com:27018
Fri Feb 18 04:07:50 [Balancer] MessagingPort say send() errno:9 Bad
file descriptor 10.76.57.113:27018
Fri Feb 18 04:15:24 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 04:15:24 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 04:15:24 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]
Fri Feb 18 04:15:24 [Balancer] ~ScopedDBConnection: _conn != null
Fri Feb 18 04:15:24 [Balancer] ~ScopedDBConnection: _conn != null
Fri Feb 18 04:15:24 [Balancer] caught exception while doing balance:
update not consistent
Fri Feb 18 04:15:56 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 04:15:56 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 04:15:56 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]


mongodb-config03

Fri Feb 18 03:44:58 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 03:44:58 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 03:44:58 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]
Fri Feb 18 03:45:28 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 03:45:28 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 03:45:28 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]
Fri Feb 18 04:07:17 [Balancer] MessagingPort recv() errno:104
Connection reset by peer 10.76.57.113:27018
Fri Feb 18 04:07:17 [Balancer] SocketException: 9001 socket exception
Fri Feb 18 04:07:17 [Balancer] ~ScopedDBConnection: _conn != null
Fri Feb 18 04:07:17 [Balancer] caught exception while doing balance:
DBClientBase::findOne: transport error
Fri Feb 18 04:07:50 [Balancer] trying reconnect to
mongodb01.example.com:27018
Fri Feb 18 04:07:50 [Balancer] reconnect mongodb01.example.com:27018
failed couldn't connect to server mongodb01.example.com:27018
Fri Feb 18 04:07:50 [Balancer] MessagingPort say send() errno:9 Bad
file descriptor 10.76.57.113:27018
Fri Feb 18 04:07:50 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 04:07:50 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 04:07:50 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]
Fri Feb 18 04:15:24 [Balancer] dist_lock forcefully taking over from:
{ _id: "balancer", process: "mongodb-config02:1297844197:1804289383",
state: 1, ts: ObjectId('4d5decea4c1cefb68d523eed'), when: new
Date(1298001130991), who: "mongodb-
config02:1297844197:1804289383:Balancer:846930886", why: "doing
balance round" } elapsed minutes: 307445734561825
Fri Feb 18 04:15:24 [Balancer] ~ScopedDBConnection: _conn != null
Fri Feb 18 04:15:24 [Balancer] ~ScopedDBConnection: _conn != null
Fri Feb 18 04:15:24 [Balancer] caught exception while doing balance:
update not consistent
Fri Feb 18 04:15:56 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 04:15:56 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 04:15:56 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]
Fri Feb 18 05:22:14 [Balancer] SyncClusterConnection connecting to
[mongodb-config03.example.com:27019]
Fri Feb 18 05:22:14 [Balancer] SyncClusterConnection connecting to
[mongodb-config02.example.com:27019]
Fri Feb 18 05:22:14 [Balancer] SyncClusterConnection connecting to
[mongodb-config01.example.com:27019]

Eliot Horowitz

unread,
Feb 18, 2011, 12:48:49 AM2/18/11
to mongod...@googlegroups.com
Can you verify all config server mongods are healthy?
Something looks amiss there

Michael Conigliaro

unread,
Feb 18, 2011, 1:11:33 AM2/18/11
to mongodb-user
How do I know if the mongod is healthy? All I see in the mongodb logs
are lots of these:

Fri Feb 18 06:10:54 [conn2619] CMD fsync: sync:1 lock:0
Fri Feb 18 06:10:54 [conn2619] fsync from getlasterror

BTW, I just discovered that the mongo shell on one of my config
servers was taking a really long time to execute any queries (show
dbs, show collections, etc.), so I stopped/started it and that seemed
to help. But now I see tons of these in the mongos.log on a different
server:

Fri Feb 18 06:07:07 [conn224] setShardVersion failed: { version:
Timestamp 3000|1, globalVersion: Timestamp 4000|0, errmsg: "going to
older version for global", ok: 0.0 }
Fri Feb 18 06:07:07 [conn224] Assertion: 10429:setShardVersion failed!
{ "version" : { "t" : 3000 , "i" : 1 }, "globalVersion" : { "t" :
4000 , "i" : 0 }, "errmsg" : "going to older version for global",
"ok" : 0 }
0x509aae 0x60bdf4 0x60b8bc
/usr/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x1de) [0x509aae]
/usr/bin/mongos(_ZN5mongo17checkShardVersionERNS_12DBClientBaseERKSsbi
+0xa34) [0x60bdf4]
/usr/bin/mongos(_ZN5mongo17checkShardVersionERNS_12DBClientBaseERKSsbi
+0x4fc) [0x60b8bc]
Fri Feb 18 06:07:07 [conn224] ~ScopedDBConnection: _conn != null
Fri Feb 18 06:07:07 [conn224] DBException in process: setShardVersion
failed! { "version" : { "t" : 3000 , "i" : 1 }, "globalVersion" :
{ "t" : 4000 , "i" : 0 }, "errmsg" : "going to older version for
global", "ok" : 0 }
> ...
>
> read more »

Michael Conigliaro

unread,
Feb 18, 2011, 1:13:15 AM2/18/11
to mongodb-user
Scratch that. That error is showing up in all mongos logs now.
> ...
>
> read more »

Eliot Horowitz

unread,
Feb 18, 2011, 1:27:04 AM2/18/11
to mongod...@googlegroups.com
You should bounce all the mongods.
Can you send a list of all servers and mongod running?
And also check all of them?

Have there been outages of some config servers?


On Fri, Feb 18, 2011 at 1:13 AM, Michael Conigliaro

Michael Conigliaro

unread,
Feb 18, 2011, 1:45:49 AM2/18/11
to mongodb-user
I just bounced everything (config servers, mongos servers, and
application servers), but I'm still seeing those errors in the mongos
logs.

The list of servers is as follows:

mongodb-config01 => config, mongos
mongodb-config02 => config, mongos
mongodb-config03 => config, mongos

I have my application servers configured to connect to these 3
machines. And no, I'm not aware of any outages other than when I've
bounced the services out of desperation.
> ...
>
> read more »

Eliot Horowitz

unread,
Feb 18, 2011, 1:48:31 AM2/18/11
to mongod...@googlegroups.com
You need to bounce the shard mongods also.
If you tar all up all logs from all servers and attach to a case @
http://jira.mongodb.org/ can take a look.
Otherwise very hard to help debug.

On Fri, Feb 18, 2011 at 1:45 AM, Michael Conigliaro

Michael Conigliaro

unread,
Feb 18, 2011, 1:49:15 AM2/18/11
to mongodb-user
I just noticed this message on two mongos servers:

Fri Feb 18 06:46:34 [Balancer] ~ScopedDBConnection: _conn != null
Fri Feb 18 06:46:34 [Balancer] ~ScopedDBConnection: _conn != null
Fri Feb 18 06:46:34 [Balancer] caught exception while doing balance:
update not consistent

Could this why the balancer isn't working? Any idea what "update not
consistent" means?
> ...
>
> read more »

Eliot Horowitz

unread,
Feb 18, 2011, 1:53:28 AM2/18/11
to mongod...@googlegroups.com
Yes that's what I was referring to above.
It generally means a config server is/was down or not reachable.

On Fri, Feb 18, 2011 at 1:49 AM, Michael Conigliaro

Michael Conigliaro

unread,
Feb 18, 2011, 2:22:25 AM2/18/11
to mongodb-user
OK, after bouncing mongods on all my shard machines and then bouncing
all my mongos again, I'm left with just this error in my mongos logs
(seen on two servers so far):

Fri Feb 18 07:10:26 [Balancer] ~ScopedDBConnection: _conn != null
Fri Feb 18 07:10:26 [Balancer] ~ScopedDBConnection: _conn != null
Fri Feb 18 07:10:26 [Balancer] caught exception while doing balance:
update not consistent

I have also verified that all three config servers are up and healthy
by connecting to their mongod processes from another machine. I'm
going to call it a night for now because I'm exhausted. Hopefully when
I check the logs again tomorrow morning, this error will have
magically fixed itself, and the balancer will be doing its thing. I'm
not sure what else to do at this point. If things are still broken,
I'll try to send those logs tomorrow, but more than likely, I'll be
contacting 10gen for commercial support. Thanks for all your help
tonight.

- Mike
> ...
>
> read more »

Eliot Horowitz

unread,
Feb 18, 2011, 8:31:18 AM2/18/11
to mongod...@googlegroups.com
One thing to do is connect to each config server and run
db.printShardingStatus() and see if there are differences

On Fri, Feb 18, 2011 at 2:22 AM, Michael Conigliaro

Michael Conigliaro

unread,
Feb 18, 2011, 9:40:40 AM2/18/11
to mongodb-user
I just diffed the results, and there are no differences at all. I've
rebooted all threes mongos several times, and this is still the only
error I'm seeing now.

- Mike
> ...
>
> read more »

Eliot Horowitz

unread,
Feb 18, 2011, 9:49:19 AM2/18/11
to mongod...@googlegroups.com
Can you connect to each config server, switch to config db, and run
db.runCommand( "dbhash" )


On Fri, Feb 18, 2011 at 9:40 AM, Michael Conigliaro

Message has been deleted

Michael Conigliaro

unread,
Feb 18, 2011, 9:58:46 AM2/18/11
to mongodb-user
Definitely some difference here (depending on how far apart my queries
are from each other). Looks like lock and lockpings mostly. The md5 at
the end is changing too.

{
"numCollections" :
12,
"collections" :
{
"changelog" :
"308a6bcba8b90cf730a07151920e63f0",
"chunks" :
"8c979856cfb6fa360e8b3e5a466a687a",
"collections" :
"afe30a5e544ed41b2a4efd150619aa83",
"databases" :
"4b227d0fdbe94e00ea062ee6e2a31500",
"lockpings" :
"d6e45f983c5e04188137c77521d2eb07",
"locks" :
"c11a79e012c44a3c77874bb7354939ce",
"mongos" :
"8a853be5eb557871a2cb2c223fce5cbf",
"settings" :
"ba6c018e3e0a46c900d5c671dcc8de6b",
"shards" : "d64e7ebe87bfd828fe283727ba88a161",
"version" : "1b2cfce930005dbc34c253633409dcc1"
},
"md5" : "32d6b98598eecf4e027b95283a49c530",
"ok" : 1
}

{
"numCollections" : 12,
"collections" : {
"changelog" : "308a6bcba8b90cf730a07151920e63f0",
"chunks" : "8c979856cfb6fa360e8b3e5a466a687a",
"collections" : "afe30a5e544ed41b2a4efd150619aa83",
"databases" :
"4b227d0fdbe94e00ea062ee6e2a31500",
"lockpings" :
"958833cbbed4b6c1a557212154bed7b1",
"locks" :
"9a19228dcb118720266159c4cda66087",
"mongos" :
"5e054e0780662fb3e59a8446826b5163",
"settings" :
"ba6c018e3e0a46c900d5c671dcc8de6b",
"shards" :
"d64e7ebe87bfd828fe283727ba88a161",
"version" :
"1b2cfce930005dbc34c253633409dcc1"
},
"md5" :
"f181ea9431b4807127db7cbcf015d6ea",
"ok" :
1
}

{
"numCollections" : 12,
"collections" : {
"changelog" : "308a6bcba8b90cf730a07151920e63f0",
"chunks" : "8c979856cfb6fa360e8b3e5a466a687a",
"collections" : "afe30a5e544ed41b2a4efd150619aa83",
"databases" :
"4b227d0fdbe94e00ea062ee6e2a31500",
"lockpings" :
"866befdfb7ffc6f5f4504d58fc1d22c6",
"locks" :
"9a19228dcb118720266159c4cda66087",
"mongos" :
"44e8664b1bb36f9bea768ae63921b1e4",
"settings" :
"ba6c018e3e0a46c900d5c671dcc8de6b",
"shards" :
"d64e7ebe87bfd828fe283727ba88a161",
"version" :
"1b2cfce930005dbc34c253633409dcc1"
},
"md5" :
"4be8b5131531ea5a232464d9f26d2d8e",
"ok" :
1
> ...
>
> read more »

Eliot Horowitz

unread,
Feb 18, 2011, 10:00:12 AM2/18/11
to mongod...@googlegroups.com
While connected to mongos, can you try dropping the lockpings collection.


On Fri, Feb 18, 2011 at 9:58 AM, Michael Conigliaro

Michael Conigliaro

unread,
Feb 18, 2011, 10:06:01 AM2/18/11
to mongodb-user
OK, the only difference now is the locks/md5 hashes on one machine.
Should I drop that one too?

Eliot Horowitz

unread,
Feb 18, 2011, 10:15:41 AM2/18/11
to mongod...@googlegroups.com
Yes, but I would restart all the mongo daemons when you do

On Fri, Feb 18, 2011 at 10:06 AM, Michael Conigliaro

Michael Conigliaro

unread,
Feb 18, 2011, 11:24:10 AM2/18/11
to mongodb-user
OK, I'm not seeing that error from the balancer anymore since I
dropped locks and lockpings, but I'm also not seeing any attempt by
the balancer to move things around. Any idea how long this usually
takes?
> ...
>
> read more »

Eliot Horowitz

unread,
Feb 18, 2011, 11:38:31 AM2/18/11
to mongod...@googlegroups.com
You should start seeing messages from the Balancer on mongos within 30
seconds of starting.
You can start the mongos with -v to see more balancer info

On Fri, Feb 18, 2011 at 11:24 AM, Michael Conigliaro

Michael Conigliaro

unread,
Feb 18, 2011, 12:01:44 PM2/18/11
to mongodb-user
Well, I guess I have my answer, but it is pretty baffling:

Fri Feb 18 16:52:30 [Balancer] dist_lock unlock: { _id: "balancer",
process: "mongodb-config02:1298047859:1804289383", state: 0, ts:
ObjectId('4d5ea3c49829372e57ee04e4'), when: new Date(1298047940874),
who: "mongodb-config02:1298047859:1804289383:Balancer:846930886", why:
"doing balance round" }
Fri Feb 18 16:52:31 [Balancer] dist_lock lock gotLock: 1 now: { _id:
"balancer", process: "mongodb-config02:1298047859:1804289383", state:
1, ts: ObjectId('4d5ea3ce9829372e57ee04e5'), when: new
Date(1298047950966), who: "mongodb-
config02:1298047859:1804289383:Balancer:846930886", why: "doing
balance round" }
Fri Feb 18 16:52:31 [Balancer] *** start balancing round
Fri Feb 18 16:52:31 [Balancer] collection : example.userEventsJournal
Fri Feb 18 16:52:31 [Balancer] donor : 148 chunks on 2
Fri Feb 18 16:52:31 [Balancer] receiver : 148 chunks on 2
Fri Feb 18 16:52:31 [Balancer] collection : example.userfile
Fri Feb 18 16:52:31 [Balancer] donor : 1 chunks on 2
Fri Feb 18 16:52:31 [Balancer] receiver : 1 chunks on 2
Fri Feb 18 16:52:31 [Balancer] collection : example.stats
Fri Feb 18 16:52:31 [Balancer] donor : 60 chunks on 2
Fri Feb 18 16:52:31 [Balancer] receiver : 60 chunks on 2
Fri Feb 18 16:52:31 [Balancer] no need to move any chunk
Fri Feb 18 16:52:31 [Balancer] *** end of balancing round

I have several tens of GB on one shard and no data on the others, and
there is no need to balance? The only reason I can think this would
happen is because I somehow got my other two shards stuck in draining
mode. While I was moving chunks around manually late last night, I
think I found a bug related to failover. After one of my shards did a
replica set failover, I discovered that db.printShardingStatus(); was
showing incorrect information for one of my chunks. It was telling me
a chunk was somewhere that it wasn't, and I couldn't fix it with the
chunk move commands. I finally tried removing my empty shards out of
desperation, but this didn't work. The fix was ultimately to run an
rs.stepDown to get my shard back to how it was originally. Then I was
able to manually move everything back to the original server and
db.printShardingStatus(); was accurate again. Though like I said, I
have two empty shards stuck in draining mode now. Can I just delete
the data on those shards and re-add them? Or is there something else I
need to do?

- Mike
> ...
>
> read more »

Eliot Horowitz

unread,
Feb 18, 2011, 12:06:52 PM2/18/11
to mongod...@googlegroups.com
Can you pastebin db.printShardingStatus() ?
We should verify where the data is and make sure its all clean before
doing anything not reversible.

Also, can you send

db.userEventsJournal.stats()

Michael Conigliaro

unread,
Feb 18, 2011, 12:14:02 PM2/18/11
to mongodb-user
https://gist.github.com/834013
> ...
>
> read more »

Michael Conigliaro

unread,
Feb 18, 2011, 12:17:04 PM2/18/11
to mongodb-user
Looks like a bug here. Check out the values under "remaining."

> db.runCommand({ removeshard : "1/mongodb01.example.com:27018,mongodb02.example.com:27018" })
{
"msg" : "draining ongoing",
"state" : "ongoing",
"remaining" : {
"chunks" : NumberLong(0),
"dbs" : NumberLong(1)
},
"ok" : 1
}

I was able to remove this shard though:

> db.runCommand({ removeshard : "3/mongodb07.example.com:27018,mongodb08.example.com:27018" })
{
"msg" : "removeshard completed successfully",
"state" : "completed",
"shard" : "3",
> ...
>
> read more »

Michael Conigliaro

unread,
Feb 18, 2011, 12:25:02 PM2/18/11
to mongodb-user
Oops, I didn't get db.userEventsJournal.stats() properly. It's fixed
now in my gist.
> ...
>
> read more »

Eliot Horowitz

unread,
Feb 18, 2011, 1:08:22 PM2/18/11
to mongod...@googlegroups.com
Looks like you tried to access local via mongos, which got things in a
bad state.
Can you manually remove the local db from config.databases, and then
restart mongos.
Then try removeShard again?

On Fri, Feb 18, 2011 at 12:25 PM, Michael Conigliaro

Message has been deleted

Eliot Horowitz

unread,
Feb 18, 2011, 2:17:11 PM2/18/11
to mongod...@googlegroups.com
Are you sure you made tr change on all config servers?

On Feb 18, 2011, at 2:10 PM, Michael Conigliaro <mi...@socialmedia.com> wrote:

> OK, that worked, but now i get this:
>
> Fri Feb 18 19:05:23 ERROR: config servers mongodb-config03.example.com:
> 27019 and mongodb-config02.example.com:27019 differconfig servers
> mongodb-config03.example.com:27019 and mongodb-
> config02.socialmedia.com:27019 differconfig servers mongodb-
> config03.example.com:27019 and mongodb-config02.example.com:27019
> differconfig servers mongodb-config03.example.com:27019 and mongodb-
> config02.example.com:27019 differconfig servers not in sync! config
> servers mongodb-config03.example.com:27019 and mongodb-
> config02.example.com:27019 differ
> chunks: "f03c5172a5360fbb3c6406805de0f3d8" chunks:
> "f03c5172a5360fbb3c6406805de0f3d8"
> databases: "680cc37d82404cfaf7a3933e568f9a49" databases:
> "4b227d0fdbe94e00ea062ee6e2a31500"
> configServer startup check failed
> Fri Feb 18 19:05:23 CursorCache at shutdown - sharded: 0
> passthrough: 0

Michael Conigliaro

unread,
Feb 18, 2011, 2:34:03 PM2/18/11
to mongodb-user
Sorry, I got it working. So I was able to remove all my shards and re-
add them. Now I'm just waiting for the balancer to start doing it's
thing...
> ...
>
> read more »

Eliot Horowitz

unread,
Feb 18, 2011, 4:46:00 PM2/18/11
to mongod...@googlegroups.com
Any progress on balancer?
Any errors in log?

On Fri, Feb 18, 2011 at 2:34 PM, Michael Conigliaro

Michael Conigliaro

unread,
Feb 18, 2011, 4:50:06 PM2/18/11
to mongodb-user
Yeah, the balancer appeared to work for a bit. I actually saw it move
one chunk earlier, but now I'm only seeing messages like this:

Fri Feb 18 21:47:28 [Balancer] *** start balancing round
Fri Feb 18 21:47:28 [Balancer] collection : example.userEventsJournal
Fri Feb 18 21:47:28 [Balancer] donor : 147 chunks on 2
Fri Feb 18 21:47:28 [Balancer] receiver : 0 chunks on 3
Fri Feb 18 21:47:28 [Balancer] chose [2] to [3] { _id:
"example.userEventsJournal-userId_"01c8a1f0-c130-5626-
d2ef-68de3f8b73da"", lastmod: Timestamp 6000|0, max: { userId:
"038a9a57-1bf5-1582-1433-7ef7d1fe4ef7" }, min: { userId: "01c8a1f0-
c130-5626-d2ef-68de3f8b73da" }, ns: "example.userEventsJournal",
shard: "2" }
Fri Feb 18 21:47:28 [Balancer] collection : example.userfile
Fri Feb 18 21:47:28 [Balancer] donor : 1 chunks on 2
Fri Feb 18 21:47:28 [Balancer] receiver : 0 chunks on 1
Fri Feb 18 21:47:28 [Balancer] moving chunk ns:
example.userEventsJournal moving ( ns:example.userEventsJournal at:
2:2/mongodb04.example.com:27018,mongodb05.example.com:27018 lastmod: 6|
0 min: { userId: "01c8a1f0-c130-5626-d2ef-68de3f8b73da" } max:
{ userId: "038a9a57-1bf5-1582-1433-7ef7d1fe4ef7" }) 2:2/
mongodb04.example.com:27018,mongodb05.example.com:27018 -> 3:3/
mongodb07.example.com:27018,mongodb08.example.com:27018
Fri Feb 18 21:47:29 [Balancer] MOVE FAILED **** _recvChunkStart
failed: not master { cause: { errmsg: "not master", ok: 0.0 }, errmsg:
"_recvChunkStart failed: not master", ok: 0.0 }
Fri Feb 18 21:47:29 [Balancer] *** end of balancing round

So I can see that something is communicating with a slave, but I'm not
sure what or why. Any tips?

- Mike
> ...
>
> read more »

Michael Conigliaro

unread,
Feb 18, 2011, 5:18:53 PM2/18/11
to mongodb-user
After failing over all my shards, the balancer looks like it's working
again. I'm not sure why this keeps happening. It's like the mongos
don't know which host is the master, and so they keep trying to
connect to slaves.
> ...
>
> read more »

Eliot Horowitz

unread,
Feb 18, 2011, 8:49:44 PM2/18/11
to mongod...@googlegroups.com
If you can send the mongod shard logs can see if we can figure out why.

On Fri, Feb 18, 2011 at 5:18 PM, Michael Conigliaro

Michael Conigliaro

unread,
Feb 28, 2011, 11:41:29 AM2/28/11
to mongodb-user
Hey Eliot, thanks for all your help. I believe I've finally solved the
last of my issues. I discovered that a lot of my "not master" errors
were caused by having the driver configured with several mongos
addresses. I didn't realize that specifying multiple addresses would
automatically make the driver act as if it's connecting to a replica
set. I've since installed a mongos on every application server, and
things seem to be working much better now.

- Mike
> ...
>
> read more »

Jeremy Zawodny

unread,
Feb 28, 2011, 6:39:27 PM2/28/11
to mongod...@googlegroups.com, Michael Conigliaro
Uh. Is that the default for all drivers? I was hoping that sharding would REALLY be transparent to the apps.

Jeremy

> ...
>
> read more »

Eliot Horowitz

unread,
Feb 28, 2011, 6:43:16 PM2/28/11
to mongod...@googlegroups.com
Not entirely sure what you or Michael mean.
The app just talks to a mongos as normal.

Dwight Merriman

unread,
Feb 28, 2011, 7:00:09 PM2/28/11
to mongod...@googlegroups.com
which driver?


> ...
>
> read more »

Reply all
Reply to author
Forward
0 new messages