Cluster freezes with "Temporary failure" to appear too often

Thomas

unread,

Jan 3, 2014, 7:52:50 AM1/3/14

to couc...@googlegroups.com

Hi,

I have been performing some tests to couchbase server 2.1.1 community edition and it seems that I have an issue with data insertion. I have been following the instructions from the following URL: http://docs.couchbase.com/couchbase-sdk-java-1.2/#advanced-usage

The problem is that it seems that for a very large period in time (e.g. more than two hours) the couchbase freezes and does not accept new documents via the Java client (set(...)) making it problematic in data insertion

Any ideas of why this is happening?

My cluster is three nodes of 42GB memory in total and I have currently about 100million documents in one bucket (default)

Let me know if you need any additional information, from my cluster

Thanks

Thomas

unread,

Jan 3, 2014, 9:55:29 AM1/3/14

to couc...@googlegroups.com

Hi,

I have tried to add some documents with cbdocloader tool but again no luck, i get the following:

couchbase.couchbaseclient.MemcachedError: Memcached error #134: Temporary failure

Any ideas?

Chad Kouse

unread,

Jan 3, 2014, 10:03:06 AM1/3/14

to couc...@googlegroups.com, couc...@googlegroups.com

When this happens it would be interesting to test if the memcached process has been restarted on any of the nodes. It sort of sounds like that is crashing and with the size of your bucket you are incurring a long warm up time.

How many replicas are you running?

--chad

--
You received this message because you are subscribed to the Google Groups "Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email to couchbase+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thomas

unread,

Jan 3, 2014, 10:09:09 AM1/3/14

to couc...@googlegroups.com

How can I check if the memcached process has been restarted? I run my nodes on linux machines rpm based

I have one replica configured

On Friday, 3 January 2014 14:52:50 UTC+2, Thomas wrote:

Chad Kouse

unread,

Jan 3, 2014, 10:15:25 AM1/3/14

to couc...@googlegroups.com, couc...@googlegroups.com

See with 3 nodes and only 1 replica there shouldn't be a time where only 1 process dies and all sets fail. During this time do some sets actually work? (Note the same key will continue failing over and over but another key may map to another vbucket which is still alive)

It's unlikely but not impossible that more than one node is crashing at the same time.

To check if the memcached process restarted you can look in the process list and either compare the pid or do something like: http://linuxcommando.blogspot.com/2008/09/how-to-get-process-start-date-and-time.html?m=1

The process should have started the same time the other couchbase services started (such as beam.smp - or just look in ps -ef | grep couchbase)

--chad

--

Thomas

unread,

Jan 3, 2014, 10:24:49 AM1/3/14

to couc...@googlegroups.com

Thank you Chad, will check it and let you know. Is there any log file to see why it is crashing? Can i disable the warmup period? or it is a standard procedure of the process?

Thanks again

Thomas

On Friday, 3 January 2014 14:52:50 UTC+2, Thomas wrote:

Chad Kouse

unread,

Jan 3, 2014, 10:29:17 AM1/3/14

to couc...@googlegroups.com, couc...@googlegroups.com

There should be some logging - most likely it's not just crashing but it's being killed by something else - like an oom killer or something. (This of course shouldn't be happening either but I've seen reports of it happening)

Can't disable the warmup period.

I don't have a lot of experience with couchbase 2.x but I think the only way to make it warm up faster is to use faster drives (ssd) or more nodes (and thus less data to warm up on any given node)

--chad

--

Thomas

unread,

Jan 3, 2014, 10:42:43 AM1/3/14

to couc...@googlegroups.com

I have checked all three nodes but the processes seems to be running from the time I initiated by cluster today

Any other ideas? Is there a way to enable a more detailed log and see where it is failing exactly? Is there a known bug of 2.1.1 community edition?

Thanks

$> ps -eAf | grep couchbase

218 3339 1 0 08:00 ? 00:00:00 /opt/couchbase/lib/erlang/erts-5.8.5/bin/epmd -daemon

218 3359 1 0 08:01 ? 00:00:10 /opt/couchbase/lib/erlang/erts-5.8.5/bin/beam.smp -A 16 -- -root /opt/couchbase/lib/erlang -progname erl -- -home /opt/couchbase -- -smp enable -kernel inet_dist_listen_min 21100 inet_dist_listen_max 21299 error_logger false -sasl sasl_error_logger false -hidden -name babysitte...@127.0.0.1 -setcookie nocookie -noshell -noinput -noshell -noinput -run ns_babysitter_bootstrap -- -couch_ini /opt/couchbase/etc/couchdb/default.ini /opt/couchbase/etc/couchdb/default.d/capi.ini /opt/couchbase/etc/couchdb/default.d/geocouch.ini /opt/couchbase/etc/couchdb/local.ini -ns_babysitter cookiefile "/opt/couchbase/var/lib/couchbase/couchbase-server.cookie" -ns_server config_path "/opt/couchbase/etc/couchbase/static_config" -ns_server pidfile "/opt/couchbase/var/lib/couchbase/couchbase-server.pid" -ns_server cookiefile "/opt/couchbase/var/lib/couchbase/couchbase-server.cookie-ns-server" -ns_server enable_mlockall true

218 3393 3359 7 08:01 ? 00:34:34 /opt/couchbase/lib/erlang/erts-5.8.5/bin/beam.smp -A 16 -sbt u -P 327680 -K true -MMmcs 30 -- -root /opt/couchbase/lib/erlang -progname erl -- -home /opt/couchbase -- -smp enable -setcookie nocookie -kernel inet_dist_listen_min 21100 inet_dist_listen_max 21299 error_logger false -sasl sasl_error_logger false -nouser -run child_erlang child_start ns_bootstrap -- -smp enable -kernel inet_dist_listen_min 21100 inet_dist_listen_max 21299 error_logger false -sasl sasl_error_logger false -couch_ini /opt/couchbase/etc/couchdb/default.ini /opt/couchbase/etc/couchdb/default.d/capi.ini /opt/couchbase/etc/couchdb/default.d/geocouch.ini /opt/couchbase/etc/couchdb/local.ini -ns_babysitter cookiefile "/opt/couchbase/var/lib/couchbase/couchbase-server.cookie" -ns_server config_path "/opt/couchbase/etc/couchbase/static_config" -ns_server pidfile "/opt/couchbase/var/lib/couchbase/couchbase-server.pid" -ns_server cookiefile "/opt/couchbase/var/lib/couchbase/couchbase-server.cookie-ns-server" -ns_server enable_mlockall true

218 3422 3393 0 08:01 ? 00:00:00 /opt/couchbase/lib/erlang/lib/os_mon-2.2.7/priv/bin/memsup

218 3423 3393 0 08:01 ? 00:00:00 /opt/couchbase/lib/erlang/lib/os_mon-2.2.7/priv/bin/cpu_sup

218 3425 3393 0 08:01 ? 00:00:00 /opt/couchbase/lib/erlang/lib/ssl-4.1.6/priv/bin/ssl_esock

218 3432 3393 0 08:01 ? 00:00:16 /opt/couchbase/lib/ns_server/erlang/lib/ns_server/priv/i386-linux-godu

218 3434 3359 0 08:01 ? 00:00:06 /opt/couchbase/bin/moxi -Z port_listen=11211,default_bucket_name=default,downstream_max=1024,downstream_conn_max=4,connect_max_errors=5,connect_retry_interval=30000,connect_timeout=400,auth_timeout=100,cycle=200,downstream_conn_queue_timeout=200,downstream_timeout=5000,wait_queue_timeout=200 -z url=http://127.0.0.1:8091/pools/default/saslBucketsStreaming -p 0 -Y y -O stderr

218 3435 3359 92 08:01 ? 06:54:56 /opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler.so -X /opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log -l 0.0.0.0:11210,0.0.0.0:11209:1000 -p 11210 -E /opt/couchbase/lib/memcached/bucket_engine.so -B binary -r -c 10000 -e admin=_admin;default_bucket_name=default;auto_create=false

On Friday, 3 January 2014 14:52:50 UTC+2, Thomas wrote:

Trond Norbye

unread,

Jan 3, 2014, 11:13:13 AM1/3/14

to couc...@googlegroups.com

Are you doing constant writes towards the cluster, and are you sure that none of the write operations succeed? The cluster will report temp failure if it is running out of memory. In this situation your cluster need to write the items to disk before the memory may be used to store another item.. Under "normal" load the cluster is able to do this without you noticing.. (In this situation a cluster with a more small nodes would be "better" than a cluster with a few large nodes, since you would have more disks doing the persistence).

If you look at the stats in the UI you should be able to see what's going on.

Cheers,

Trond

--

You received this message because you are subscribed to the Google Groups "Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email to couchbase+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Trond Norbye

Thomas

unread,

Jan 3, 2014, 11:20:31 AM1/3/14

to couc...@googlegroups.com

Hey Trond thanks for helping out,

The thing is that now for about an hour or so I do not perform any operation, but still if I go on and try to do a write it fails, stats show zero disk operations. I will now try to create a new cluster with more smaller nodes

T.

On Friday, 3 January 2014 14:52:50 UTC+2, Thomas wrote:

Trond Norbye

unread,

Jan 3, 2014, 11:41:58 AM1/3/14

to couc...@googlegroups.com

Perhaps you could post the statistics from the server somewhere? (using the cbc tool from libcouchbase should do that easily)..

Cheers,

Trond

--

You received this message because you are subscribed to the Google Groups "Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email to couchbase+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Trond Norbye

Message has been deleted

Thomas

unread,

Jan 3, 2014, 12:22:33 PM1/3/14

to couc...@googlegroups.com

This constant issue led me to a cluster restart but I was watching some of the statistics and something that i noticed was that ep_num_eject_failures was constantly increasing.. I do not know if that helps.

I will try to reach the cluster in the same state and printout a complet stats output

T.

On Friday, 3 January 2014 14:52:50 UTC+2, Thomas wrote:

Matt Ingenthron

unread,

Jan 3, 2014, 2:03:37 PM1/3/14

to couc...@googlegroups.com

Hi Thomas,

Trond is on the right track here. This sounds like a memory accounting issue, and the stats would verify. If you're getting TMPFAIL and it's trying to eject things and failing to do so, that could be that the system is full of metadata (though that should give an OOM) or we're seeing a bug.

Some stats would help us determine where to go from here, as Trond mentioned.

Thanks,

Matt

--

Matt Ingenthron

Couchbase, Inc.

Thomas

unread,

Jan 7, 2014, 4:01:00 AM1/7/14

to couc...@googlegroups.com

Hi again,

Today I have started the same set of tests for cb server; currently I have been adding data to server with a small rate (200/sec), my stats look like the below

Every 2.0s: /opt/couchbase/bin/cbstats localhost:11210 -b default all | egrep 'item|mem|flusher|ep_queue|bg|eje|resi|warm' Tue Jan 7 08:45:05 2014

curr_items: 40798244
curr_items_tot: 49348787
curr_temp_items: 0
ep_access_scanner_num_items: 0
ep_bg_fetch_delay: 0
ep_bg_fetched: 0
ep_bg_meta_fetched: 0
ep_bg_remaining_jobs: 0
ep_chk_max_items: 5000
ep_diskqueue_items: 1812316
ep_diskqueue_memory: 57994112
ep_failpartialwarmup: 0
ep_flusher_state: running
ep_flusher_todo: 115917
ep_item_begin_failed: 0
ep_item_commit_failed: 0
ep_item_flush_expired: 0
ep_item_flush_failed: 0
ep_item_num_based_new_chk: 1
ep_items_rm_from_checkpoints: 66
ep_max_item_size: 20971520
ep_mem_high_wat: 12777527705
ep_mem_low_wat: 11274289152
ep_mem_tracker_enabled: true
ep_meta_data_memory: 4920394995
ep_mutation_mem_threshold: 95
ep_num_eject_failures: 1111132682
ep_num_non_resident: 47445563
ep_num_value_ejects: 10029584
ep_queue_size: 1812316
ep_tap_backfill_resident: 0.9
ep_tap_bg_fetch_requeued: 0
ep_tap_bg_fetched: 6120426
ep_tap_bg_max_pending: 500
ep_total_del_items: 0
ep_total_new_items: 6759834
ep_uncommitted_items: 115917
ep_waitforwarmup: 0
ep_warmup: 1
ep_warmup_batch_size: 1000
ep_warmup_dups: 0
ep_warmup_min_items_threshold: 100
ep_warmup_min_memory_threshold: 100
ep_warmup_oom: 0
ep_warmup_thread: complete

Do you note anything unusual? Some of the inserts are backing off with retries from my client implementation. Some of my objects require to retry for about 20-30 times till they are accepted by the server. From the stats i find that the queue is a bit large with this rate I insert data.. And i cannot understand the ep_num_eject_failures which is constantly increasing, is it some sort of a counter? for insert failures or something similar?