Cluster freezes with "Temporary failure" to appear too often

521 views
Skip to first unread message

Thomas

unread,
Jan 3, 2014, 7:52:50 AM1/3/14
to couc...@googlegroups.com
Hi,

I have been performing some tests to couchbase server 2.1.1 community edition and it seems that I have an issue with data insertion. I have been following the instructions from the following URL: http://docs.couchbase.com/couchbase-sdk-java-1.2/#advanced-usage

The problem is that it seems that for a very large period in time (e.g. more than two hours) the couchbase freezes and does not accept new documents via the Java client (set(...)) making it problematic in data insertion

Any ideas of why this is happening?

My cluster is three nodes of 42GB memory in total and I have currently about 100million documents in one bucket (default)

Let me know if you need any additional information, from my cluster

Thanks

Thomas

unread,
Jan 3, 2014, 9:55:29 AM1/3/14
to couc...@googlegroups.com
Hi,

I have tried to add some documents with cbdocloader tool but again no luck, i get the following: 

couchbase.couchbaseclient.MemcachedError: Memcached error #134:  Temporary failure

Any ideas?

Chad Kouse

unread,
Jan 3, 2014, 10:03:06 AM1/3/14
to couc...@googlegroups.com, couc...@googlegroups.com
When this happens it would be interesting to test if the memcached process has been restarted on any of the nodes. It sort of sounds like that is crashing and with the size of your bucket you are incurring a long warm up time. 

How many replicas are you running?
--chad


--
You received this message because you are subscribed to the Google Groups "Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email to couchbase+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thomas

unread,
Jan 3, 2014, 10:09:09 AM1/3/14
to couc...@googlegroups.com
How can I check if the memcached process has been restarted? I run my nodes on linux machines rpm based

I have one replica configured


On Friday, 3 January 2014 14:52:50 UTC+2, Thomas wrote:

Chad Kouse

unread,
Jan 3, 2014, 10:15:25 AM1/3/14
to couc...@googlegroups.com, couc...@googlegroups.com
See with 3 nodes and only 1 replica there shouldn't be a time where only 1 process dies and all sets fail. During this time do some sets actually work? (Note the same key will continue failing over and over but another key may map to another vbucket which is still alive)

It's unlikely but not impossible that more than one node is crashing at the same time. 

To check if the memcached process restarted you can look in the process list and either compare the pid or do something like: http://linuxcommando.blogspot.com/2008/09/how-to-get-process-start-date-and-time.html?m=1

The process should have started the same time the other couchbase services started (such as beam.smp - or just look in ps -ef | grep couchbase)


--chad


--

Thomas

unread,
Jan 3, 2014, 10:24:49 AM1/3/14
to couc...@googlegroups.com
Thank you Chad, will check it and let you know. Is there any log file to see why it is crashing? Can i disable the warmup period? or it is a standard procedure of the process?

Thanks again

Thomas


On Friday, 3 January 2014 14:52:50 UTC+2, Thomas wrote:

Chad Kouse

unread,
Jan 3, 2014, 10:29:17 AM1/3/14
to couc...@googlegroups.com, couc...@googlegroups.com
There should be some logging - most likely it's not just crashing but it's being killed by something else - like an oom killer or something. (This of course shouldn't be happening either but I've seen reports of it happening)

Can't disable the warmup period.

I don't have a lot of experience with couchbase 2.x but I think the only way to make it warm up faster is to use faster drives (ssd) or more nodes (and thus less data to warm up on any given node)
--chad


--

Thomas

unread,
Jan 3, 2014, 10:42:43 AM1/3/14
to couc...@googlegroups.com
I have checked all three nodes but the processes seems to be running from the time I initiated by cluster today

Any other ideas? Is there a way to enable a more detailed log and see where it is failing exactly? Is there a known bug of 2.1.1 community edition?

Thanks



$> ps -eAf | grep couchbase
218       3339     1  0 08:00 ?        00:00:00 /opt/couchbase/lib/erlang/erts-5.8.5/bin/epmd -daemon
218       3359     1  0 08:01 ?        00:00:10 /opt/couchbase/lib/erlang/erts-5.8.5/bin/beam.smp -A 16 -- -root /opt/couchbase/lib/erlang -progname erl -- -home /opt/couchbase -- -smp enable -kernel inet_dist_listen_min 21100 inet_dist_listen_max 21299 error_logger false -sasl sasl_error_logger false -hidden -name babysitte...@127.0.0.1 -setcookie nocookie -noshell -noinput -noshell -noinput -run ns_babysitter_bootstrap -- -couch_ini /opt/couchbase/etc/couchdb/default.ini /opt/couchbase/etc/couchdb/default.d/capi.ini /opt/couchbase/etc/couchdb/default.d/geocouch.ini /opt/couchbase/etc/couchdb/local.ini -ns_babysitter cookiefile "/opt/couchbase/var/lib/couchbase/couchbase-server.cookie" -ns_server config_path "/opt/couchbase/etc/couchbase/static_config" -ns_server pidfile "/opt/couchbase/var/lib/couchbase/couchbase-server.pid" -ns_server cookiefile "/opt/couchbase/var/lib/couchbase/couchbase-server.cookie-ns-server" -ns_server enable_mlockall true
218       3393  3359  7 08:01 ?        00:34:34 /opt/couchbase/lib/erlang/erts-5.8.5/bin/beam.smp -A 16 -sbt u -P 327680 -K true -MMmcs 30 -- -root /opt/couchbase/lib/erlang -progname erl -- -home /opt/couchbase -- -smp enable -setcookie nocookie -kernel inet_dist_listen_min 21100 inet_dist_listen_max 21299 error_logger false -sasl sasl_error_logger false -nouser -run child_erlang child_start ns_bootstrap -- -smp enable -kernel inet_dist_listen_min 21100 inet_dist_listen_max 21299 error_logger false -sasl sasl_error_logger false -couch_ini /opt/couchbase/etc/couchdb/default.ini /opt/couchbase/etc/couchdb/default.d/capi.ini /opt/couchbase/etc/couchdb/default.d/geocouch.ini /opt/couchbase/etc/couchdb/local.ini -ns_babysitter cookiefile "/opt/couchbase/var/lib/couchbase/couchbase-server.cookie" -ns_server config_path "/opt/couchbase/etc/couchbase/static_config" -ns_server pidfile "/opt/couchbase/var/lib/couchbase/couchbase-server.pid" -ns_server cookiefile "/opt/couchbase/var/lib/couchbase/couchbase-server.cookie-ns-server" -ns_server enable_mlockall true
218       3422  3393  0 08:01 ?        00:00:00 /opt/couchbase/lib/erlang/lib/os_mon-2.2.7/priv/bin/memsup
218       3423  3393  0 08:01 ?        00:00:00 /opt/couchbase/lib/erlang/lib/os_mon-2.2.7/priv/bin/cpu_sup
218       3425  3393  0 08:01 ?        00:00:00 /opt/couchbase/lib/erlang/lib/ssl-4.1.6/priv/bin/ssl_esock
218       3432  3393  0 08:01 ?        00:00:16 /opt/couchbase/lib/ns_server/erlang/lib/ns_server/priv/i386-linux-godu
218       3434  3359  0 08:01 ?        00:00:06 /opt/couchbase/bin/moxi -Z port_listen=11211,default_bucket_name=default,downstream_max=1024,downstream_conn_max=4,connect_max_errors=5,connect_retry_interval=30000,connect_timeout=400,auth_timeout=100,cycle=200,downstream_conn_queue_timeout=200,downstream_timeout=5000,wait_queue_timeout=200 -z url=http://127.0.0.1:8091/pools/default/saslBucketsStreaming -p 0 -Y y -O stderr
218       3435  3359 92 08:01 ?        06:54:56 /opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler.so -X /opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log -l 0.0.0.0:11210,0.0.0.0:11209:1000 -p 11210 -E /opt/couchbase/lib/memcached/bucket_engine.so -B binary -r -c 10000 -e admin=_admin;default_bucket_name=default;auto_create=false


On Friday, 3 January 2014 14:52:50 UTC+2, Thomas wrote:

Trond Norbye

unread,
Jan 3, 2014, 11:13:13 AM1/3/14
to couc...@googlegroups.com
Are you doing constant writes towards the cluster, and are you sure that none of the write operations succeed? The cluster will report temp failure if it is running out of memory. In this situation your cluster need to write the items to disk before the memory may be used to store another item.. Under "normal" load the cluster is able to do this without you noticing.. (In this situation a cluster with a more small nodes would be "better" than a cluster with a few large nodes, since you would have more disks doing the persistence).

If you look at the stats in the UI you should be able to see what's going on.

Cheers,

Trond



--
You received this message because you are subscribed to the Google Groups "Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email to couchbase+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Trond Norbye

Thomas

unread,
Jan 3, 2014, 11:20:31 AM1/3/14
to couc...@googlegroups.com
Hey Trond thanks for helping out,

The thing is that now for about an hour or so I do not perform any operation, but still if I go on and try to do a write it fails, stats show zero disk operations. I will now try to create a new cluster with more smaller nodes

T.

On Friday, 3 January 2014 14:52:50 UTC+2, Thomas wrote:

Trond Norbye

unread,
Jan 3, 2014, 11:41:58 AM1/3/14
to couc...@googlegroups.com
Perhaps you could post the statistics from the server somewhere? (using the cbc tool from libcouchbase should do that easily)..

Cheers,

Trond



--
You received this message because you are subscribed to the Google Groups "Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email to couchbase+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Trond Norbye
Message has been deleted

Thomas

unread,
Jan 3, 2014, 12:22:33 PM1/3/14
to couc...@googlegroups.com
This constant issue led me to a cluster restart but I was watching some of the statistics and something that i noticed was that  ep_num_eject_failures was constantly increasing.. I do not know if that helps. 

I will try to reach the cluster in the same state and printout a complet stats output


T.

On Friday, 3 January 2014 14:52:50 UTC+2, Thomas wrote:

Matt Ingenthron

unread,
Jan 3, 2014, 2:03:37 PM1/3/14
to couc...@googlegroups.com
Hi Thomas,

Trond is on the right track here.  This sounds like a memory accounting issue, and the stats would verify.  If you're getting TMPFAIL and it's trying to eject things and failing to do so, that could be that the system is full of metadata (though that should give an OOM) or we're seeing a bug.

Some stats would help us determine where to go from here, as Trond mentioned.

Thanks,

Matt

-- 
Matt Ingenthron
Couchbase, Inc.

Thomas

unread,
Jan 7, 2014, 4:01:00 AM1/7/14
to couc...@googlegroups.com
Hi again,

Today I have started the same set of tests for cb server; currently I have been adding data to server with a small rate (200/sec), my stats look like the below

Every 2.0s: /opt/couchbase/bin/cbstats localhost:11210 -b default all | egrep 'item|mem|flusher|ep_queue|bg|eje|resi|warm'          Tue Jan  7 08:45:05 2014
 
 curr_items:                                40798244
 curr_items_tot:                          49348787
 curr_temp_items:                       0
 ep_access_scanner_num_items:        0
 ep_bg_fetch_delay:                  0
 ep_bg_fetched:                      0
 ep_bg_meta_fetched:                 0
 ep_bg_remaining_jobs:               0
 ep_chk_max_items:                   5000
 ep_diskqueue_items:                 1812316
 ep_diskqueue_memory:                57994112
 ep_failpartialwarmup:               0
 ep_flusher_state:                   running
 ep_flusher_todo:                    115917
 ep_item_begin_failed:               0
 ep_item_commit_failed:              0
 ep_item_flush_expired:              0
 ep_item_flush_failed:               0
 ep_item_num_based_new_chk:          1
 ep_items_rm_from_checkpoints:       66
 ep_max_item_size:                   20971520
 ep_mem_high_wat:                    12777527705
 ep_mem_low_wat:                     11274289152
 ep_mem_tracker_enabled:             true
 ep_meta_data_memory:                4920394995
 ep_mutation_mem_threshold:          95
 ep_num_eject_failures:              1111132682
 ep_num_non_resident:                47445563
 ep_num_value_ejects:                10029584
 ep_queue_size:                      1812316
 ep_tap_backfill_resident:           0.9
 ep_tap_bg_fetch_requeued:           0
 ep_tap_bg_fetched:                  6120426
 ep_tap_bg_max_pending:              500
 ep_total_del_items:                 0
 ep_total_new_items:                 6759834
 ep_uncommitted_items:               115917
 ep_waitforwarmup:                   0
 ep_warmup:                          1
 ep_warmup_batch_size:               1000
 ep_warmup_dups:                     0
 ep_warmup_min_items_threshold:      100
 ep_warmup_min_memory_threshold:     100
 ep_warmup_oom:                      0
 ep_warmup_thread:                   complete

Do you note anything unusual? Some of the inserts are backing off with retries from my client implementation. Some of my objects require to retry for about 20-30 times till they are accepted by the server. From the stats i find that the queue is a bit large with this rate I insert data.. And i cannot understand the ep_num_eject_failures which is constantly increasing, is it some sort of a counter? for insert failures or something similar?

Thanks


On Friday, 3 January 2014 14:52:50 UTC+2, Thomas wrote:
Reply all
Reply to author
Forward
0 new messages