process crashes and node went down

297 views
Skip to first unread message

Niels Boldt

unread,
Nov 20, 2013, 7:49:43 AM11/20/13
to couc...@googlegroups.com
Hi,

I have experienced a process crash on a node, that caused the entire node to go down

From the log in the console, I have

Port server ns_server on node 'babysitte...@127.0.0.1' exited with status 134. Restarting. Messages: Apache CouchDB 1.2.0a-386be73-git (LogLevel=info) is starting.
Apache CouchDB has started. Time to relax.
working as port
/opt/couchbase/lib/erlang/lib/os_mon-2.2.7/priv/bin/memsup: Erlang has closed.
Erlang has closed

Crash dump was written to: erl_crash.dump.1384711092.3044
eheap_alloc: Cannot allocate 1459620480 bytes of memory (of type "heap"). ns_log000 ns...@production.couchbase.node.8 12:39:37 - Wed Nov 20, 2013
Node 'ns...@production.couchbase.node.8' synchronized otp cookie nhntjjqiovvnjcvf from cluster ns_cookie_manager002 ns...@production.couchbase.node.8 12:39:37 - Wed Nov 20, 2013
Port server moxi on node 'babysitte...@127.0.0.1' exited with status 0. Restarting. Messages: WARNING: curl error: couldn't connect to host from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming
ERROR: could not contact REST server(s): http://127.0.0.1:8091/pools/default/saslBucketsStreaming
WARNING: curl error: couldn't connect to host from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming
ERROR: could not contact REST server(s): http://127.0.0.1:8091/pools/default/saslBucketsStreaming
WARNING: curl error: couldn't connect to host from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming
ERROR: could not contact REST server(s): http://127.0.0.1:8091/pools/default/saslBucketsStreaming
WARNING: curl error: couldn't connect to host from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming
ERROR: could not contact REST server(s): http://127.0.0.1:8091/pools/default/saslBucketsStreaming
WARNING: curl error: couldn't connect to host from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming
ERROR: could not contact REST server(s): http://127.0.0.1:8091/pools/def

After the crash, the node was unresponsive, and I had to kill and restart server on node to make it join the cluster which it did immediately. The reason for the server to become unresponsive was probably that a couchbase process was using all available cpu


I'm running a cluster of 3 nodes on aws on m1.large instances with the following specs

General purpose m1.large 64-bit 2 vcpu 4 ecu 7.5 gigs 2 x 420 gb

I'm running a single replication to another cluster

I have allocated 17.6 gigs in total for the cluster and 10.1 gigs of these are used. All together the machines does have 22.5 gigs of memory.

I'm suspecting that the crash might be related to https://www.couchbase.com/issues/browse/MB-9097 so I'm trying to stop the replication.

Does anybody have a suggestion to a possible workaround or should I try to file an issue

I'm currently not allowed to post questions in the community forums for couchbase, I'm just getting a message saying "access denied". Anyone experienced this

Thanks
Niels






--
BinaryConstructors ApS
Vestergade 10a, 4th
1456 Kbh K
Denmark
phone: +4529722259
web: http://www.binaryconstructors.dk
mail: n...@binaryconstructors.dk
skype: nielsboldt

Chad Kouse

unread,
Nov 20, 2013, 12:22:31 PM11/20/13
to couc...@googlegroups.com, couc...@googlegroups.com
When you killed the server process are you sure it wasn't just in the warmup stage?
--chad


--
You received this message because you are subscribed to the Google Groups "Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email to couchbase+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Aliaksey Kandratsenka

unread,
Nov 20, 2013, 3:02:15 PM11/20/13
to couc...@googlegroups.com
Possibly. If indeed you're running xdcr we've recently fixed two memory issues in this code base: MB-9209.

Niels Boldt

unread,
Nov 20, 2013, 4:27:53 PM11/20/13
to couc...@googlegroups.com
When you killed the server process are you sure it wasn't just in the warmup stage?
--chad

I'm pretty sure it wasn't. The node was unresponsive for more than 15 minutes and was marked as down in the console. Also when trying to access the console on the node, through a browser it did not respond 

After restart it joined the cluster and was up and running in less than a minute 

Thanks
Niels

Niels Boldt

unread,
Nov 20, 2013, 4:46:09 PM11/20/13
to couc...@googlegroups.com
Will that fix first be available in 2.5 or will it be backported to older version

/Niels


 

Aliaksey Kandratsenka

unread,
Nov 20, 2013, 4:57:52 PM11/20/13
to couc...@googlegroups.com

2.5. with strong possibility of hot fix for enterprise customers.

>
> /Niels

Reply all
Reply to author
Forward
0 new messages