MongoDB - unstable?

40 views
Skip to first unread message

Bo

unread,
Oct 12, 2010, 11:00:43 AM10/12/10
to mongodb-user
Hi all

I'm a newbie concerning MongoDB but as many of you intriqued by the
potentials that mongoDB gives.

However, I do have some issues/thoughts about MongoDB I would like to
share with you.

To give you a context - I have 3 VPSs running on 3 different machines
and the setup is an arbiter, a secondary and a primary. The secondary
has a "slaveOk" to allow for reads.

The first problem I encountered was when starting each mongod on each
machine (one for each) with:

"mongod --fork --logpath path/to/log --replSet setName --rest"

If using "top" I could see the process was spawned allright, but if
doing a "mongo" to go to the shell I was told that the shell could not
connect to the db. I gave it some time to connect but again futile. I
did not manage to use shell at all due to not able to connect to db.
To solve this I installed a never version of MongoDB on all 3 VPS but
no change. By incident I noticed that if leaving out the --fork I
could connect to db with mongo so I started all mongod instances this
way and then added the members to the replica set and only then did I
shut down each mongod process one at a time and then restart the
instance with a --fork. Somewhat difficult to get up and running!

The second problem and much more serious is that the replica set is
very unstable that is in just 3 days the replica set has been down 4
times! The message being "[ReplSetHealthPollTask] replSet info
hostname.domain.com is now down (or slow to respond)". In 2 cases the
secondary or the arbiter was down, in 1 case both the arbiter and the
secondary were down and today the primary went down with "[rs_sync]
replSet syncThread: 10278 dbclient error communicating with server".
I'm truly surprised how unstable mongoDB really is. I'm hoping that
I'm doing something wrong in my setup that could explain this.

tony tam

unread,
Oct 12, 2010, 11:34:39 AM10/12/10
to mongodb-user
I think it's safe to assume that if it were that unstable, so many of
us would not be using Mongodb in production.

Tony

Sergei Tulentsev

unread,
Oct 12, 2010, 1:42:40 PM10/12/10
to mongod...@googlegroups.com
Well, Tony, that might just as well mean that we all are reckless guys that like to play with new toys  :-)

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.




--
Best regards,
Sergei Tulentsev

Kyle Banker

unread,
Oct 12, 2010, 2:18:38 PM10/12/10
to mongod...@googlegroups.com
Bo,

We haven't see the sort of instability you're describing. It seems like the individual nodes are going down dude to networking issues. Is that your suspicion? We'd like to help. Can you provide
- MongoDB version
- ReplSet config object
- Relevant log files

Kyle

Markus Gattol

unread,
Oct 12, 2010, 2:50:53 PM10/12/10
to mongodb-user
What kind of virtualization do you use? OpenVZ/LXC/Virtuozzo? There
have been reports of memory management issues with those. However, in
addition to answering that, please send a tail of your logs.

Bo

unread,
Oct 12, 2010, 3:27:29 PM10/12/10
to mongodb-user
Yes, I am aware of that and that is why I sure there is a solution to
this, but the fact is that I'm experiencing these problems.

Bo

unread,
Oct 12, 2010, 4:10:27 PM10/12/10
to mongodb-user
Kyle, thanks for helping me out :)

See below, I have copied what I could find

On Oct 12, 8:18 pm, Kyle Banker <k...@10gen.com> wrote:
> Bo,
>
> We haven't see the sort of instability you're describing. It seems like the
> individual nodes are going down dude to networking issues. Is that your
> suspicion? We'd like to help. Can you provide
> - MongoDB version

db version v1.6.3, pdfile version 4.5

> - ReplSet config object

query local.system.replset

{ "_id" : "setname",
"version" : 1,
"members" : [
{ "_id" : 0,
"host" : "hostname1.somedomain.com" },
{ "_id" : 1,
"host" : "hostname2.somedomain.com",
"arbiterOnly" : true },
{ "_id" : 2,
"host" : "hostname3.somedomain.com" } ] }

> - Relevant log files

Log from Primary (Showing last line - suggesting an sudden
interruption):

Tue Oct 12 01:49:18 [conn8] getmore local.oplog.rs cid:
8832731092773422124 getMore: { ts: { $gte: new
Date(5526531326833852417) } } bytes:20 nreturned:0 10387ms


Log from Secondary:

Mon Oct 11 12:21:33 MongoDB starting : pid=8843 port=27017 dbpath=/
data/db/ 64-bit
Mon Oct 11 12:21:33 db version v1.6.3, pdfile version 4.5
Mon Oct 11 12:21:33 git version:
278bd2ac2f2efbee556f32c13c1b6803224d1c01
Mon Oct 11 12:21:33 sys info: Linux domU-12-31-39-06-79-A1
2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64
BOOST_LIB_VERSION=1_41
Mon Oct 11 12:21:36 [initandlisten] waiting for connections on port
27017
Mon Oct 11 12:21:36 [startReplSets] replSet can't get
local.system.replset config from self or any seed (yet)
Mon Oct 11 12:21:39 [websvr] web admin interface listening on port
28017
Mon Oct 11 12:21:39 [initandlisten] connection accepted from
so.me.ip.number1:54564 #1
Mon Oct 11 12:21:40 [initandlisten] connection accepted from
so.me.ip.number2:51843 #2
Mon Oct 11 12:21:46 [initandlisten] connection accepted from
so.me.ip.number3:41076 #3
Mon Oct 11 12:21:46 [startReplSets] replSet STARTUP2
Mon Oct 11 12:21:46 [rs Manager] replSet can't see a majority, will
not try to elect self
Mon Oct 11 12:21:48 [ReplSetHealthPollTask] replSet info
hostname1.somedomain.com is now up
Mon Oct 11 12:21:48 [ReplSetHealthPollTask] replSet
hostname1.somedomain.com ARBITER
Mon Oct 11 12:21:48 [ReplSetHealthPollTask] replSet info
hostname3.somedomain.com is now up
Mon Oct 11 12:21:48 [ReplSetHealthPollTask] replSet
hostname3.somedomain.com PRIMARY
Mon Oct 11 12:21:48 [rs Manager] replSet info electSelf 2
Mon Oct 11 12:21:48 [rs Manager] replSet PRIMARY
Mon Oct 11 12:21:48 [rs_sync] replSet SECONDARY
Mon Oct 11 12:36:47 [conn2] end connection so.me.ip.number2:51843
Mon Oct 11 12:36:47 [rs_sync] replSet syncThread: 10278 dbclient error
communicating with server
Mon Oct 11 12:36:48 [ReplSetHealthPollTask] replSet info
hostname3.somedomain.com is now down (or slow to respond)
Mon Oct 11 12:36:48 [rs Manager] replSet info electSelf 2
Mon Oct 11 12:36:48 [rs Manager] replSet PRIMARY
Mon Oct 11 12:39:09 [ReplSetHealthPollTask] replSet info
hostname3.somedomain.com is now up
Mon Oct 11 12:39:09 [ReplSetHealthPollTask] replSet
hostname3.somedomain.com STARTUP2
Mon Oct 11 12:39:09 [initandlisten] connection accepted from
so.me.ip.number2:48926 #4
Mon Oct 11 12:39:11 [ReplSetHealthPollTask] replSet
hostname3.somedomain.com RECOVERING
Mon Oct 11 12:39:11 [initandlisten] connection accepted from
so.me.ip.number2:48943 #5
Mon Oct 11 12:39:11 [conn5] query local.oplog.rs ntoreturn:1 reslen:
115 nscanned:1 {} nreturned:1 150ms
Mon Oct 11 12:39:13 [conn5] query local.oplog.rs reslen:115 nscanned:1
{ ts: { $gte: new Date(5526531326833852417) } } nreturned:1 2285ms
Mon Oct 11 12:39:15 [ReplSetHealthPollTask] replSet
hostname3.somedomain.com SECONDARY
Mon Oct 11 12:39:15 [slaveTracking] building new index on { _id: 1 }
for local.slaves
Mon Oct 11 12:39:15 [slaveTracking] done for 0 records 0.132secs
Mon Oct 11 12:39:15 [slaveTracking] update local.slaves query: { _id:
ObjectId('4cb367df19f53a593c506823'), host: "so.me.ip.number2", ns:
"local.oplog.rs" } 506ms
Mon Oct 11 12:39:22 [conn5] getmore local.oplog.rs cid:
5696201091655256647 getMore: { ts: { $gte: new
Date(5526531326833852417) } } bytes:20 nreturned:0 9030ms
Mon Oct 11 12:39:29 [conn5] getmore local.oplog.rs cid:
5696201091655256647 getMore: { ts: { $gte: new
Date(5526531326833852417) } } bytes:20 nreturned:0 6315ms
Mon Oct 11 12:39:35 [conn5] getmore local.oplog.rs cid:
5696201091655256647 getMore: { ts: { $gte: new
Date(5526531326833852417) } } bytes:20 nreturned:0 6209ms
Mon Oct 11 12:39:41 [conn5] getmore local.oplog.rs cid:
5696201091655256647 getMore: { ts: { $gte: new
Date(5526531326833852417) } } bytes:20 nreturned:0 6307ms
Mon Oct 11 12:39:47 [conn5] getmore local.oplog.rs cid:
5696201091655256647 getMore: { ts: { $gte: new
Date(5526531326833852417) } } bytes:20 nreturned:0 5854ms
Mon Oct 11 12:39:49 [initandlisten] connection accepted from
so.me.ip.number3:40229 #6
Mon Oct 11 12:39:54 [conn5] getmore local.oplog.rs cid:
5696201091655256647 getMore: { ts: { $gte: new
Date(5526531326833852417) } } bytes:20 nreturned:0 6526ms
Mon Oct 11 12:40:00 [conn5] getmore local.oplog.rs cid:
5696201091655256647 getMore: { ts: { $gte: new
Date(5526531326833852417) } } bytes:20 nreturned:0 6655ms
Mon Oct 11 12:40:07 [conn5] getmore local.oplog.rs cid:
5696201091655256647 getMore: { ts: { $gte: new
Date(5526531326833852417) } } bytes:20 nreturned:0 6463ms
Mon Oct 11 12:40:13 [conn5] getmore local.oplog.rs cid:
5696201091655256647 getMore: { ts: { $gte: new
Date(5526531326833852417) } } bytes:20 nreturned:0 6587ms
Mon Oct 11 12:40:15 [conn6] replSet RECOVERING
Mon Oct 11 12:40:15 [conn6] replSet info stepped down as primary
Mon Oct 11 12:40:16 [conn4] replSet info voting yea for 0
Mon Oct 11 12:40:17 [ReplSetHealthPollTask] replSet
hostname3.somedomain.com PRIMARY
Mon Oct 11 12:40:17 [rs_sync] replSet SECONDARY
Mon Oct 11 12:40:19 [conn5] getmore local.oplog.rs cid:
5696201091655256647 getMore: { ts: { $gte: new
Date(5526531326833852417) } } bytes:20 nreturned:0 5834ms
Mon Oct 11 12:40:19 [conn5] end connection so.me.ip.number2:48943
Mon Oct 11 15:04:58 [conn6] end connection so.me.ip.number3:40229
Mon Oct 11 15:54:26 [initandlisten] connection accepted from
so.me.ip.number1:54089 #7
Mon Oct 11 15:54:31 [conn1] end connection so.me.ip.number1:54564
Mon Oct 11 15:54:34 [ReplSetHealthPollTask] replSet info
hostname1.somedomain.com is now down (or slow to respond)
Mon Oct 11 15:54:36 [ReplSetHealthPollTask] replSet info
hostname1.somedomain.com is now up
Tue Oct 12 02:13:32 [ReplSetHealthPollTask] MessagingPort recv()
remote dead so.me.ip.number2:27017
Tue Oct 12 02:13:32 [ReplSetHealthPollTask] SocketException: 9001
socket exception
Tue Oct 12 02:13:32 [ReplSetHealthPollTask] replSet info
hostname3.somedomain.com is now down (or slow to respond)
Tue Oct 12 02:13:32 [rs Manager] replSet info electSelf 2
Tue Oct 12 02:13:32 [rs Manager] replSet PRIMARY
Tue Oct 12 03:49:18 [rs_sync] MessagingPort recv() errno:104
Connection reset by peer so.me.ip.number2:27017
Tue Oct 12 03:49:18 [rs_sync] SocketException: 9001 socket exception
Tue Oct 12 03:49:18 [rs_sync] MessagingPort flush send() errno:32
Broken pipe so.me.ip.number2:27017
Tue Oct 12 03:49:18 [rs_sync] caught exception (socket exception) in
destructor (~PiggyBackData)
Tue Oct 12 03:49:18 [rs_sync] replSet syncThread: 10278 dbclient error
communicating with server
Tue Oct 12 03:49:20 [conn4] end connection so.me.ip.number2:48926
Tue Oct 12 11:14:42 [initandlisten] connection accepted from
so.me.ip.number1:36864 #8
Tue Oct 12 11:14:43 [conn8] end connection so.me.ip.number1:36864

>
> Kyle
>
> On Tue, Oct 12, 2010 at 1:42 PM, Sergei Tulentsev <
>
> >> mongodb-user...@googlegroups.com<mongodb-user%2Bunsu...@googlegroups.com>
> >> .
> >> For more options, visit this group at
> >>http://groups.google.com/group/mongodb-user?hl=en.
>
> > --
> > Best regards,
> > Sergei Tulentsev
>
> >  --
> > You received this message because you are subscribed to the Google Groups
> > "mongodb-user" group.
> > To post to this group, send email to mongod...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > mongodb-user...@googlegroups.com<mongodb-user%2Bunsu...@googlegroups.com>
> > .

Bo

unread,
Oct 12, 2010, 4:15:00 PM10/12/10
to mongodb-user
Hi Markus - also a big thanks to you for helping...

I just chatted with my hosting center aand all they could say was
vserver, I can see they use this kind of server to provide VPSs,
please have a look at http://linux-vserver.org/Welcome_to_Linux-VServer.org

Markus Gattol

unread,
Oct 12, 2010, 5:00:07 PM10/12/10
to mongodb-user
ok, you're having connection timeouts; maybe because you're using
hostnames rather than IPs; can you try with IPs please

other than that, LVS belongs into the same category namely operating
system virtualization rather than eg paravirtualization but I really
think in this case using hostnames is the issue. try with IPs, see if
it goes away

Bo

unread,
Oct 13, 2010, 4:28:05 PM10/13/10
to mongodb-user
Hi Markus

I tried your suggestion and the replSet is now set with IPs:

query local.system.replset

{ "_id" : "setname",
"version" : 3,
"members" : [
{ "_id" : 0,
"host" : "xxx.xxx.xxx.xx" },
{ "_id" : 1,
"host" : "xxx.xxx.xxx.xx" },
{ "_id" : 2,
"host" : "xx.xxx.xxx.xx",
"arbiterOnly" : true } ] }

... but 41 minutes ago I received a "connect/transport error" and
[ReplSetHealthPollTask] replSet info xxx.xxx.xxx.xx is now down (or
slow to respond). It was the primary that failed.

Kyle Banker

unread,
Oct 18, 2010, 3:39:47 PM10/18/10
to mongod...@googlegroups.com
How is it that the primary failed? Did it crash? If not, you most
likely have a networking issue, as Markus suggests.

Raoul

unread,
Oct 18, 2010, 6:01:30 PM10/18/10
to mongodb-user

We've experienced this same behavior when starting a replica set. The
solution was just to wait awhile before trying to connect, in our case
about 10 or 15 minutes. The replica set allocates a lot of files
initially, though I'm not sure that was the reason for requiring the
delay.

Markus Gattol

unread,
Oct 19, 2010, 4:26:39 AM10/19/10
to mongodb-user


On Oct 18, 11:01 pm, Raoul <convenient.acco...@gmail.com> wrote:
> We've experienced this same behavior when starting a replica set. The
> solution was just to wait awhile before trying to connect, in our case
> about 10 or 15 minutes.

That's not really an acceptable solution. Did you have a look at
what's flying back and forth on the wire using tcpdump for example?
Now that we know DNS isn't the culprit, looking what's going on with
the network is the next logical step to look at.

> The replica set allocates a lot of files
> initially, though I'm not sure that was the reason for requiring the
> delay.

You should use a modern filesystem in order to have instant file
allocations
http://www.markus-gattol.name/ws/mongodb.html#filesystem

Kristina Chodorow

unread,
Oct 19, 2010, 8:07:29 AM10/19/10
to mongod...@googlegroups.com
Also, you shouldn't have to wait 15 minutes, but unless you're using the 1.7 master you'll have to wait a few minutes for the replica set to realize it isn't configured and accept a config (http://jira.mongodb.org/browse/SERVER-1847).


Alvin Richards

unread,
Oct 19, 2010, 2:05:26 PM10/19/10
to mongodb-user
As Markus suggest, most modern file-system deal with the allocation
correctly (XFS, EXT4). However the default file-system may not be one
of these, for example on EC2, the instance storage is EXT3, which will
take 10-15 minutes to start as it preallocated files.

Check you default with using the mount command, e.g.

ubuntu@domU-12-31-39-0B-1C-76:~$ mount
/dev/sda1 on / type ext3 (rw)
/dev/sda2 on /mnt type ext3 (rw)

.. you will see in this example, they are ext3...

-Alvin


On Oct 19, 5:07 am, Kristina Chodorow <krist...@10gen.com> wrote:
> Also, you shouldn't have to wait 15 minutes, but unless you're using the 1.7
> master you'll have to wait a few minutes for the replica set to realize it
> isn't configured and accept a config (http://jira.mongodb.org/browse/SERVER-1847).
>
> On Tue, Oct 19, 2010 at 4:26 AM, Markus Gattol <markus.gat...@gmail.com>wrote:
>
>
>
>
>
> > On Oct 18, 11:01 pm, Raoul <convenient.acco...@gmail.com> wrote:
> > > We've experienced this same behavior when starting a replica set. The
> > > solution was just to wait awhile before trying to connect, in our case
> > > about 10 or 15 minutes.
>
> > That's not really an acceptable solution. Did you have a look at
> > what's flying back and forth on the wire using tcpdump for example?
> > Now that we know DNS isn't the culprit, looking what's going on with
> > the network is the next logical step to look at.
>
> > > The replica set allocates a lot of files
> > > initially, though I'm not sure that was the reason for requiring the
> > > delay.
>
> > You should use a modern filesystem in order to have instant file
> > allocations
> >http://www.markus-gattol.name/ws/mongodb.html#filesystem
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "mongodb-user" group.
> > To post to this group, send email to mongod...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > mongodb-user...@googlegroups.com<mongodb-user%2Bunsubscribe@google groups.com>
> > .
Reply all
Reply to author
Forward
0 new messages