Couchbase 3.0 Node goes down for every other day

48 views
Skip to first unread message

ashwini ahire

unread,
Nov 18, 2014, 1:42:56 AM11/18/14
to couc...@googlegroups.com

Couchbase 3.0 node going down at every weekend.
Version: 3.0.0 Enterprise Edition (build-1209)
Cluster State ID: 03B-020-217

Please see below logs.
Request you to pls let me know , to avoid this Failove.

Event     Module Code     Server Node     Time
Remote cluster reference "Virginia_to_OregonS" updated. New name is "VirginiaM_to_OregonS".     menelaus_web_remote_clusters000     ns_1ec2-####104.compute-1.amazonaws.com     12:46:38 - Mon Nov 17, 2014
Client-side error-report for user undefined on node 'ns_1@ec2-####108 -.compute-1.amazonaws.com':
User-Agent:Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0
Got unhandled error:
Script error.
At:
http://ph.couchbase.net/v2?callback=jQuery162012552191850461902_1416204362614&launchID=8eba0b18a4e965daf1c3a0baecec994c-1416208180553-3638&version=3.0.0-1209-rel-enterprise&_=1416208180556:0:0
Backtrace:
<generated>
generateStacktrace@http://ec2-####108 -.compute-1.amazonaws.com:8091/js/bugsnag.js:411:7
bugsnag@http://ec2-####108 -.compute-1.amazonaws.com:8091/js/bugsnag.js:555:13

    menelaus_web102     ns_1@ec2-####108 -.compute-1.amazonaws.com     12:45:56 - Mon Nov 17, 2014
Replication from bucket "apro" to bucket "apro" on cluster "Virginia_to_OregonS" created.     menelaus_web_xdc_replications000     ns_1@ec2-####108 -.compute-1.amazonaws.com     12:38:49 - Mon Nov 17, 2014
Replication from bucket "apro" to bucket "apro" on cluster "Virginia_to_OregonS" removed.     xdc_rdoc_replication_srv000     ns_1@ec2-####108 -.compute-1.amazonaws.com     12:38:40 - Mon Nov 17, 2014
Rebalance completed successfully.
    ns_orchestrator001     ns_1@ec2-####107.compute-1.amazonaws.com     11:53:17 - Mon Nov 17, 2014
Bucket "ifa" rebalance does not seem to be swap rebalance     ns_vbucket_mover000     ns_1@ec2-####107.compute-1.amazonaws.com     11:53:04 - Mon Nov 17, 2014
Started rebalancing bucket ifa     ns_rebalancer000     ns_1@ec2-####107.compute-1.amazonaws.com     11:53:02 - Mon Nov 17, 2014
Could not automatically fail over node ('ns_1@ec2-####108 -.compute-1.amazonaws.com'). Rebalance is running.     auto_failover001     ns_1@ec2-####107.compute-1.amazonaws.com     11:49:58 - Mon Nov 17, 2014
Bucket "apro" rebalance does not seem to be swap rebalance     ns_vbucket_mover000     ns_1@ec2-####107.compute-1.amazonaws.com     11:48:02 - Mon Nov 17, 2014
Started rebalancing bucket apro     ns_rebalancer000     ns_1@ec2-####107.compute-1.amazonaws.com     11:47:59 - Mon Nov 17, 2014
Bucket "apro" loaded on node 'ns_1@ec2-####108 -.compute-1.amazonaws.com' in 366 seconds.     ns_memcached000     ns_1@ec2-####108 -.compute-1.amazonaws.com     11:47:58 - Mon Nov 17, 2014
Bucket "ifa" loaded on node 'ns_1@ec2-####108 -.compute-1.amazonaws.com' in 96 seconds.     ns_memcached000     ns_1@ec2-####108 -.compute-1.amazonaws.com     11:43:29 - Mon Nov 17, 2014
Starting rebalance, KeepNodes = ['ns_1ec2-####104.compute-1.amazonaws.com',
'ns_1@ec2-####107.compute-1.amazonaws.com',
'ns_1@ec2-####108 -.compute-1.amazonaws.com'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@ec2-####108 -.compute-1.amazonaws.com'], Delta recovery buckets = all     ns_orchestrator004     ns_1@ec2-####107.compute-1.amazonaws.com     11:41:52 - Mon Nov 17, 2014
Control connection to memcached on 'ns_1@ec2-####108 -.compute-1.amazonaws.com' disconnected: {badmatch,
{error,
closed}}     ns_memcached000     ns_1@ec2-####108 -.compute-1.amazonaws.com     21:19:54 - Sun Nov 16, 2014
Node ('ns_1@ec2-####108 -.compute-1.amazonaws.com') was automatically failovered.
[stale,
{last_heard,{1416,152978,82869}},
{stale_slow_status,{1416,152863,60088}},
{now,{1416,152968,80503}},
{active_buckets,["apro","ifa"]},
{ready_buckets,["ifa"]},
{status_latency,5743},
{outgoing_replications_safeness_level,[{"apro",green},{"ifa",green}]},
{incoming_replications_conf_hashes,
[{"apro",
[{'ns_1ec2-####104.compute-1.amazonaws.com',126796989},
{'ns_1@ec2-####107.compute-1.amazonaws.com',41498822}]},
{"ifa",
[{'ns_1ec2-####104.compute-1.amazonaws.com',126796989},
{'ns_1@ec2-####107.compute-1.amazonaws.com',41498822}]}]},
{local_tasks,
[[{type,xdcr},
{id,<<"949dcce68db4b6d1add4c033ec4e32a9/apro/apro">>},
{errors,
[<<"2014-11-16 19:35:03 [Vb Rep] Error replicating vbucket 201. Please see logs for details.">>]},
{changes_left,220},
{docs_checked,51951817},
{docs_written,51951817},
{active_vbreps,4},
{max_vbreps,4},
{waiting_vbreps,210},
{time_working,1040792.401734},
{time_committing,0.0},
{time_working_rate,0.9101340661254117},
{num_checkpoints,53490},
{num_failedckpts,1},
{wakeups_rate,11.007892659036528},
{worker_batches_rate,20.514709046386255},
{rate_replication,22.015785318073057},
{bandwidth_usage,880.6314127229223},
{rate_doc_checks,22.015785318073057},
{rate_doc_opt_repd,22.015785318073057},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,1271.0828664152195},
{docs_latency_wt,20.514709046386255}],
[{type,xdcr},
{id,<<"fc72b1b0e571e9c57671d6621cac6058/apro/apro">>},
{errors,[]},
{changes_left,278},
{docs_checked,51217335},
{docs_written,51217335},
{active_vbreps,4},
{max_vbreps,4},
{waiting_vbreps,269},
{time_working,1124595.930738},
{time_committing,0.0},
{time_working_rate,1.019751359238166},
{num_checkpoints,54571},
{num_failedckpts,3},
{wakeups_rate,6.50472893793788},
{worker_batches_rate,16.51200422707308},
{rate_replication,23.01673316501096},
{bandwidth_usage,936.6809670630547},
{rate_doc_checks,23.01673316501096},
{rate_doc_opt_repd,23.01673316501096},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,1500.9621995190503},
{docs_latency_wt,16.51200422707308}],
[{type,xdcr},
{id,<<"16b1afb33dbcbde3d075e2ff634d9cc0/apro/apro">>},
{errors,
[<<"2014-11-16 19:21:55 [Vb Rep] Error replicating vbucket 258. Please see logs for details.">>,
<<"2014-11-16 19:22:41 [Vb Rep] Error replicating vbucket 219. Please see logs for details.">>,
<<"2014-11-16 19:23:04 [Vb Rep] Error replicating vbucket 315. Please see logs for details.">>,
<<"2014-11-16 20:06:40 [Vb Rep] Error replicating vbucket 643. Please see logs for details.">>,
<<"2014-11-16 20:38:20 [Vb Rep] Error replicating vbucket 651. Please see logs for details.">>]},
{changes_left,0},
{docs_checked,56060297},
{docs_written,56060297},
{active_vbreps,0},
{max_vbreps,4},
{waiting_vbreps,0},
{time_working,140073.119377},
{time_committing,0.0},
{time_working_rate,0.04649055712180432},
{num_checkpoints,103504},
{num_failedckpts,237},
{wakeups_rate,21.524796565643623},
{worker_batches_rate,22.52594989427821},
{rate_replication,22.52594989427821},
{bandwidth_usage,913.0518357147434},
{rate_doc_checks,22.52594989427821},
{rate_doc_opt_repd,22.52594989427821},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,13.732319632216313},
{docs_latency_wt,22.52594989427821}],
[{type,xdcr},
{id,<<"b734095ad63ea9832f9da1b1ef3449ac/apro/apro">>},
{errors,
[<<"2014-11-16 19:36:22 [Vb Rep] Error replicating vbucket 260. Please see logs for details.">>,
<<"2014-11-16 19:36:38 [Vb Rep] Error replicating vbucket 299. Please see logs for details.">>,
<<"2014-11-16 19:36:43 [Vb Rep] Error replicating vbucket 205. Please see logs for details.">>,
<<"2014-11-16 19:36:48 [Vb Rep] Error replicating vbucket 227. Please see logs for details.">>,
<<"2014-11-16 20:26:19 [Vb Rep] Error replicating vbucket 175. Please see logs for details.">>,
<<"2014-11-16 20:26:25 [Vb Rep] Error replicating vbucket 221. Please see logs for details.">>,
<<"2014-11-16 21:16:40 [Vb Rep] Error replicating vbucket 293. Please see logs for details.">>,
<<"2014-11-16 21:16:40 [Vb Rep] Error replicating vbucket 251. Please see logs for details.">>,
<<"2014-11-16 21:17:06 [Vb Rep] Error replicating vbucket 270. Please see logs for details.">>]},
{changes_left,270},
{docs_checked,50418639},
{docs_written,50418639},
{active_vbreps,4},
{max_vbreps,4},
{waiting_vbreps,261},
{time_working,1860159.788732},
{time_committing,0.0},
{time_working_rate,1.008940755729142},
{num_checkpoints,103426},
{num_failedckpts,87},
{wakeups_rate,6.50782891818858},
{worker_batches_rate,16.01927118323343},
{rate_replication,23.027702325898055},
{bandwidth_usage,933.1225464233472},
{rate_doc_checks,23.027702325898055},
{rate_doc_opt_repd,23.027702325898055},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,1367.9901922012182},
{docs_latency_wt,16.01927118323343}],
[{type,xdcr},
{id,<<"e213600feb7ec1dfa0537173ad7f2e02/apro/apro">>},
{errors,
[<<"2014-11-16 20:16:39 [Vb Rep] Error replicating vbucket 647. Please see logs for details.">>,
<<"2014-11-16 20:17:31 [Vb Rep] Error replicating vbucket 619. Please see logs for details.">>]},
{changes_left,854},
{docs_checked,33371659},
{docs_written,33371659},
{active_vbreps,4},
{max_vbreps,4},
{waiting_vbreps,318},
{time_working,2421539.8537169998},
{time_committing,0.0},
{time_working_rate,1.7382361098734072},
{num_checkpoints,102421},
{num_failedckpts,85},
{wakeups_rate,3.0038659755104824},
{worker_batches_rate,7.009020609524459},
{rate_replication,30.539304084356573},
{bandwidth_usage,1261.6237097144026},
{rate_doc_checks,30.539304084356573},
{rate_doc_opt_repd,30.539304084356573},
{meta_latency_aggr,0.0},
{meta_latency_wt,0.0},
{docs_latency_aggr,1997.2249284829577},
{docs_latency_wt,7.009020609524459}]]},
{memory,
[{total,752400928},
{processes,375623512},
{processes_used,371957960},
{system,376777416},
{atom,594537},
{atom_used,591741},
{binary,94783616},
{code,15355960},
{ets,175831736}]},
{system_memory_data,
[{system_total_memory,64552329216},
{free_swap,0},
{total_swap,0},
{cached_memory,27011342336},
{buffered_memory,4885585920},
{free_memory,12694065152},
{total_memory,64552329216}]},
{node_storage_conf,
[{db_path,"/data/couchbase"},{index_path,"/data/couchbase"}]},
{statistics,
[{wall_clock,{552959103,4997}},
{context_switches,{8592101014,0}},
{garbage_collection,{2034857586,5985868018204,0}},
{io,{{input,270347194989},{output,799175854069}}},
{reductions,{833510054494,7038093}},
{run_queue,0},
{runtime,{553128340,5090}},
{run_queues,{0,0,0,0,0,0,0,0}}]},
{system_stats,
[{cpu_utilization_rate,2.5316455696202533},
{swap_total,0},
{swap_used,0},
{mem_total,64552329216},
{mem_free,44590993408}]},
{interesting_stats,
[{cmd_get,0.0},
{couch_docs_actual_disk_size,21729991305},
{couch_docs_data_size,11673379153},
{couch_views_actual_disk_size,0},
{couch_views_data_size,0},
{curr_items,30268090},
{curr_items_tot,60625521},
{ep_bg_fetched,0.0},
{get_hits,0.0},
{mem_used,11032659776},
{ops,116.0},
{vb_replica_curr_items,30357431}]},
{per_bucket_interesting_stats,
[{"ifa",
[{cmd_get,0.0},
{couch_docs_actual_disk_size,611617800},
{couch_docs_data_size,349385716},
{couch_views_actual_disk_size,0},
{couch_views_data_size,0},
{curr_items,1020349},
{curr_items_tot,2039753},
{ep_bg_fetched,0.0},
{get_hits,0.0},
{mem_used,307268040},
{ops,0.0},
{vb_replica_curr_items,1019404}]},
{"apro",
[{cmd_get,0.0},
{couch_docs_actual_disk_size,21118373505},
{couch_docs_data_size,11323993437},
{couch_views_actual_disk_size,0},
{couch_views_data_size,0},
{curr_items,29247741},
{curr_items_tot,58585768},
{ep_bg_fetched,0.0},
{get_hits,0.0},
{mem_used,10725391736},
{ops,116.0},
{vb_replica_curr_items,29338027}]}]},
{processes_stats,
[{<<"proc/(main)beam.smp/cpu_utilization">>,0},
{<<"proc/(main)beam.smp/major_faults">>,0},
{<<"proc/(main)beam.smp/major_faults_raw">>,0},
{<<"proc/(main)beam.smp/mem_resident">>,943411200},
{<<"proc/(main)beam.smp/mem_share">>,6901760},
{<<"proc/(main)beam.smp/mem_size">>,2951794688},
{<<"proc/(main)beam.smp/minor_faults">>,0},
{<<"proc/(main)beam.smp/minor_faults_raw">>,456714435},
{<<"proc/(main)beam.smp/page_faults">>,0},
{<<"proc/(main)beam.smp/page_faults_raw">>,456714435},
{<<"proc/beam.smp/cpu_utilization">>,0},
{<<"proc/beam.smp/major_faults">>,0},
{<<"proc/beam.smp/major_faults_raw">>,0},
{<<"proc/beam.smp/mem_resident">>,108077056},
{<<"proc/beam.smp/mem_share">>,2973696},
{<<"proc/beam.smp/mem_size">>,1113272320},
{<<"proc/beam.smp/minor_faults">>,0},
{<<"proc/beam.smp/minor_faults_raw">>,6583},
{<<"proc/beam.smp/page_faults">>,0},
{<<"proc/beam.smp/page_faults_raw">>,6583},
{<<"proc/memcached/cpu_utilization">>,0},
{<<"proc/memcached/major_faults">>,0},
{<<"proc/memcached/major_faults_raw">>,0},
{<<"proc/memcached/mem_resident">>,17016668160},
{<<"proc/memcached/mem_share">>,6885376},
{<<"proc/memcached/mem_size">>,17812746240},
{<<"proc/memcached/minor_faults">>,0},
{<<"proc/memcached/minor_faults_raw">>,4385001},
{<<"proc/memcached/page_faults">>,0},
{<<"proc/memcached/page_faults_raw">>,4385001}]},
{cluster_compatibility_version,196608},
{version,
[{lhttpc,"1.3.0"},
{os_mon,"2.2.14"},
{public_key,"0.21"},
{asn1,"2.0.4"},
{couch,"2.1.1r-432-gc2af28d"},
{kernel,"2.16.4"},
{syntax_tools,"1.6.13"},
{xmerl,"1.3.6"},
{ale,"3.0.0-1209-rel-enterprise"},
{couch_set_view,"2.1.1r-432-gc2af28d"},
{compiler,"4.9.4"},
{inets,"5.9.8"},
{mapreduce,"1.0.0"},
{couch_index_merger,"2.1.1r-432-gc2af28d"},
{ns_server,"3.0.0-1209-rel-enterprise"},
{oauth,"7d85d3ef"},
{crypto,"3.2"},
{ssl,"5.3.3"},
{sasl,"2.3.4"},
{couch_view_parser,"1.0.0"},
{mochiweb,"2.4.2"},
{stdlib,"1.19.4"}]},
{supported_compat_version,[3,0]},
{advertised_version,[3,0,0]},
{system_arch,"x86_64-unknown-linux-gnu"},
{wall_clock,552959},
{memory_data,{64552329216,51966836736,{<13661.389.0>,147853368}}},
{disk_data,
[{"/",10309828,38},
{"/dev/shm",31519692,0},
{"/mnt",154817516,1},
{"/data",1056894132,3}]},
{meminfo,
<<"MemTotal: 63039384 kB\nMemFree: 12396548 kB\nBuffers: 4771080 kB\nCached: 26378264 kB\nSwapCached: 0 kB\nActive: 31481704 kB\nInactive: 17446048 kB\nActive(anon): 17750620 kB\nInactive(anon): 2732 kB\nActive(file): 13731084 kB\nInactive(file): 17443316 kB\nUnevictable: 0 kB\nMlocked: 0 kB\nSwapTotal: 0 kB\nSwapFree: 0 kB\nDirty: 13312 kB\nWriteback: 0 kB\nAnonPages: 17753376 kB\nMapped: 14516 kB\nShmem: 148 kB\nSlab: 1297976 kB\nSReclaimable: 1219296 kB\nSUnreclaim: 78680 kB\nKernelStack: 2464 kB\nPageTables: 39308 kB\nNFS_Unstable: 0 kB\nBounce: 0 kB\nWritebackTmp: 0 kB\nCommitLimit: 31519692 kB\nCommitted_AS: 19222984 kB\nVmallocTotal: 34359738367 kB\nVmallocUsed: 114220 kB\nVmallocChunk: 34359618888 kB\nHardwareCorrupted: 0 kB\nAnonHugePages: 17432576 kB\nHugePages_Total: 0\nHugePages_Free: 0\nHugePages_Rsvd: 0\nHugePages_Surp: 0\nHugepagesize: 2048 kB\nDirectMap4k: 6144 kB\nDirectMap2M: 63993856 kB\n">>}]     auto_failover001     ns_1@ec2-####107.compute-1.amazonaws.com     21:19:53 - Sun Nov 16, 2014
Failed over 'ns_1@ec2-####108 -.compute-1.amazonaws.com': ok     ns_rebalancer000     ns_1@ec2-####107.compute-1.amazonaws.com     21:19:53 - Sun Nov 16, 2014
Skipped vbucket activations and replication topology changes because not all remaining node were found to have healthy bucket "ifa": ['ns_1@ec2-####107.compute-1.amazonaws.com']     ns_rebalancer000     ns_1@ec2-####107.compute-1.amazonaws.com     21:19:53 - Sun Nov 16, 2014
Shutting down bucket "ifa" on 'ns_1@ec2-####108 -.compute-1.amazonaws.com' for deletion     ns_memcached000     ns_1@ec2-####108 -.compute-1.amazonaws.com     21:19:49 - Sun Nov 16, 2014
Starting failing over 'ns_1@ec2-####108 -.compute-1.amazonaws.com'     ns_rebalancer000     ns_1@ec2-####107.compute-1.amazonaws.com     21:19:48 - Sun Nov 16, 2014
Bucket "apro" loaded on node 'ns_1@ec2-####108 -.compute-1.amazonaws.com' in 0 seconds.     ns_memcached000     ns_1@ec2-####108 -.compute-1.amazonaws.com     21:19:44 - Sun Nov 16, 2014
Control connection to memcached on 'ns_1@ec2-####108 -.compute-1.amazonaws.com' disconnected: {{badmatch,
{error,
timeout}},
[{mc_client_binary,
cmd_vocal_recv,
5,
[{file,
"src/mc_client_binary.erl"},
{line,
151}]},
{mc_client_binary,
select_bucket,
2,
[{file,
"src/mc_client_binary.erl"},
{line,
346}]},
{ns_memcached,
ensure_bucket,
2,
[{file,
"src/ns_memcached.erl"},
{line,
1269}]},
{ns_memcached,
handle_info,
2,
[{file,
"src/ns_memcached.erl"},
{line,
744}]},
{gen_server,
handle_msg,
5,
[{file,
"gen_server.erl"},
{line,
604}]},
{ns_memcached,
init,
1,
[{file,
"src/ns_memcached.erl"},
{line,
171}]},
{gen_server,
init_it,
6,
[{file,
"gen_server.erl"},
{line,
304}]},
{proc_lib,
init_p_do_apply,
3,[{file,
"proc_lib.erl"},
{line,
239}]}]}     ns_memcached000     ns_1@ec2-####108 -.compute-1.amazonaws.com     21:19:44 - Sun Nov 16, 2014
Reply all
Reply to author
Forward
0 new messages