Membase 1.7 data loss

99 views
Skip to first unread message

kay kay

unread,
Oct 3, 2012, 3:25:03 AM10/3/12
to couc...@googlegroups.com
I have faced a strange problem with data loss.

I use memcached buckets and two separate clusters with three servers for membase 1.7. Sometimes the necessary data became lost. I have added monitoring and analyzed recent logs to figure out what is wrong.

There are two type of data loss:

- data lost and never recovered (rarely) on first cluster
This situation occurs when one of the membase nodes became lost rarely on first cluster. Logs showed that data node went down and got up after few seconds. And necessary data became lost. For some reason this problem occurs only on third membase node.

- data lost for some moment and automatically recovered in few seconds (frequently) on second cluster
This situation occurs frequently on second membase cluster. There are no "node went down" logs, but data lost only for few seconds.

Monitoring works this way: I added 1000 keys with test data and check this data every minute.

I have a connectivity monitoring between whole cluster nodes, I have attached graphs. Could anyone help me to solve these problems?
first.png
second.png

Aliaksey Kandratsenka

unread,
Oct 3, 2012, 7:38:49 PM10/3/12
to couc...@googlegroups.com
Given it's memcached bucket some more details would help. Particularly
what is your client's setup ?

kay kay

unread,
Oct 4, 2012, 1:06:14 AM10/4/12
to couc...@googlegroups.com
I use moxi:

port_listen=11220,

default_bucket_name=default,

downstream_max=0,
downstream_conn_max=0,
downstream_conn_queue_timeout=200,
downstream_timeout=400,
wait_queue_timeout=200,
connect_max_errors=5,
connect_retry_interval=30000,
connect_timeout=200,

cycle=200

moxi-cluster.cfg:


четверг, 4 октября 2012 г., 3:38:51 UTC+4 пользователь Aliaksey Kandratsenka написал:

Frank Weigel

unread,
Oct 4, 2012, 10:45:51 PM10/4/12
to couc...@googlegroups.com
Memcached buckets are pure caching buckets and don't provide replication or persistence. So if a node goes down in the sense of crashing or rebooting, it is expected to come back without any data. If one node seems to keep going down, did you check for crashes of the memcached process (or even node reboots)?

Furthermore, as a cache data may be evicted from the cache at any point as new data is added.
So if you keep doing writes, some of the data is expected to be "lost". Using Couchbase buckets would give you persistence and replication, so data is available across node failure and no data would ever be evicted (it may be evicted from cache, but would be retrieved from disk on a cache miss). Couchbase buckets are still accessible with a pure memcached API (thanks to moxi), so you would not need to change you application to switch to Couchbase buckets.

Apologies if the caching nature and resulting behaviour was clear and you already excluded those factors; just want to make sure that the expected behaviour of memcached buckets was clear.

Cheers,

Frank


kay kay

unread,
Oct 5, 2012, 2:28:11 AM10/5/12
to couc...@googlegroups.com
Actually problem node doesn't crash. For other nodes it just became "went down". Other nodes for the "went down" node became "went down" too. Here is the cluster log example:

2012-09-26 10:37:14.775 - warning - Node 'ns...@192.168.3.7' saw that node 'ns...@192.168.3.8' went down.
2012-09-26 10:37:15.970 - warning - Node 'ns...@192.168.3.6' saw that node 'ns...@192.168.3.8' went down.
2012-09-26 10:38:20.640 - info - Node 'ns...@192.168.3.7' saw that node 'ns...@192.168.3.8' came up.
2012-09-26 10:38:20.653 - info - Node 'ns...@192.168.3.6' saw that node 'ns...@192.168.3.8' came up.
2012-09-26 10:38:20.710 - warning - Node 'ns...@192.168.3.8' saw that node 'ns...@192.168.3.7' went down.
2012-09-26 10:38:20.710 - warning - Node 'ns...@192.168.3.8' saw that node 'ns...@192.168.3.6' went down.
2012-09-26 10:38:20.710 - info - Node 'ns...@192.168.3.8' saw that node 'ns...@192.168.3.7' came up.
2012-09-26 10:38:20.710 - info - Node 'ns...@192.168.3.8' saw that node 'ns...@192.168.3.6' came up.

The network monitoring looks fine so I don't know where the problem lies.

пятница, 5 октября 2012 г., 6:45:59 UTC+4 пользователь Frank написал:

Frank Weigel

unread,
Oct 5, 2012, 2:42:25 AM10/5/12
to couc...@googlegroups.com

Does the log on that node show anything (e.g. That the server restarted)?

F

kay kay

unread,
Oct 5, 2012, 3:30:39 AM10/5/12
to couc...@googlegroups.com
Here you are, the latest ones

INFO REPORT <5881.32276.3759> 2012-10-01 07:15:21
===============================================================================
ns_log: logging ns_node_disco:5:Node 'ns...@192.168.3.8' saw that node 'ns...@192.168.3.6' went down.
ERROR REPORT <5881.401.0> 2012-10-01 07:15:21
===============================================================================
ns...@192.168.3.8:<5881.401.0>:system_stats_collector:130: lost 9 ticks
ERROR REPORT <5881.3651.1906> 2012-10-01 07:15:21
===============================================================================
ns...@192.168.3.8:<5881.3651.1906>:stats_collector:126: Dropped 10 ticks
ERROR REPORT <5881.19937.3636> 2012-10-01 07:15:21
===============================================================================
ns...@192.168.3.8:<5881.19937.3636>:stats_collector:126: Dropped 10 ticks
INFO REPORT <5881.32276.3759> 2012-10-01 07:15:21
===============================================================================
ns_log: logging ns_node_disco:5:Node 'ns...@192.168.3.8' saw that node 'ns...@192.168.3.7' went down.
INFO REPORT <5881.338.0> 2012-10-01 07:15:21
===============================================================================
ns_node_disco_log: nodes changed: ['ns...@192.168.3.7',
'ns...@192.168.3.8']
ERROR REPORT <5881.1450.1906> 2012-10-01 07:15:21
===============================================================================
ns...@192.168.3.8:<5881.1450.1906>:stats_collector:126: Dropped 10 ticks
ERROR REPORT <5881.3297.2484> 2012-10-01 07:15:21
===============================================================================
ns...@192.168.3.8:<5881.3297.2484>:stats_collector:126: Dropped 10 ticks
ERROR REPORT <5881.5199.1906> 2012-10-01 07:15:21
===============================================================================
ns...@192.168.3.8:<5881.5199.1906>:stats_collector:126: Dropped 10 ticks
INFO REPORT <5881.32276.3759> 2012-10-01 07:15:21
===============================================================================
ns_log: logging ns_node_disco:4:Node 'ns...@192.168.3.8' saw that node 'ns...@192.168.3.6' came up.

пятница, 5 октября 2012 г., 10:42:37 UTC+4 пользователь Frank написал:

kay kay

unread,
Oct 5, 2012, 4:28:15 AM10/5/12
to couc...@googlegroups.com
I have found very strange log. Don't know if it related to my problem.

In past I had old membase servers which were in cluster, and then they were reinstalled to KVM host nodes with new OS. For some reason I found these nodes in logs among the current nodes. I didn't restart the membase cluster but removed nodes from the configuration via web control panel. How it could be possible for old membase nodes to be in current logs?

I can restart cluster now to figure out if the bug disappeared. But I guess developers need some info on this bug so I'll wait some time.

пятница, 5 октября 2012 г., 11:30:39 UTC+4 пользователь kay kay написал:

kay kay

unread,
Oct 10, 2012, 10:24:50 AM10/10/12
to couc...@googlegroups.com
I still have problems with membase cluster.

I wrote a script which executes get requests of 1000 items.

Sometimes it gets 991, 999, 936 items. After few seconds items count became 1000 again. Membase logs doesn't show any error.

Can anybody help me?

Chad Kouse

unread,
Oct 10, 2012, 11:21:40 AM10/10/12
to couc...@googlegroups.com
How many nodes in your cluster?  Is this a membase bucket type? Are you connecting via moxi?  The counts you're seeing, are they in the web console? Are you doing 1000 gets or 1 multiget?  What client are you using?

--
Chad Kouse

kay kay

unread,
Oct 11, 2012, 1:04:17 AM10/11/12
to couc...@googlegroups.com
I have three nodes.
It is memcache bucket (it is faster than membase).
I connecting via moxi.
I have attached the python script to answer the rest questions.

Membase server - 1.7.2
Moxi server - 1.7.1

I appreciate your help.

среда, 10 октября 2012 г., 19:21:45 UTC+4 пользователь chadkouse написал:
membase_check_data.py

kay kay

unread,
Oct 11, 2012, 6:44:59 AM10/11/12
to couc...@googlegroups.com
Detailed logs showed me that errors occurred when moxi status shows "tot_downstream_timeout". I also noticed packet drops on ethernet interface.

Now I'll try to figure out what is wrong.

Thanks for help.

Frank Weigel

unread,
Oct 11, 2012, 8:09:48 PM10/11/12
to couc...@googlegroups.com
Could you upgrade to 1.8.1 and moxi 1.8 ? I believe a bunch of moxi errors were fixed.

From: kay kay <kay....@gmail.com>
Reply-To: "couc...@googlegroups.com" <couc...@googlegroups.com>
Date: Thursday, October 11, 2012 3:44 AM
To: "couc...@googlegroups.com" <couc...@googlegroups.com>
Subject: Re: Membase 1.7 data loss

Reply all
Reply to author
Forward
0 new messages