Membase 1.7 data loss

Showing 1-13 of 13 messages
Membase 1.7 data loss kay ru 10/3/12 12:25 AM
I have faced a strange problem with data loss.

I use memcached buckets and two separate clusters with three servers for membase 1.7. Sometimes the necessary data became lost. I have added monitoring and analyzed recent logs to figure out what is wrong.

There are two type of data loss:

- data lost and never recovered (rarely) on first cluster
This situation occurs when one of the membase nodes became lost rarely on first cluster. Logs showed that data node went down and got up after few seconds. And necessary data became lost. For some reason this problem occurs only on third membase node.

- data lost for some moment and automatically recovered in few seconds (frequently) on second cluster
This situation occurs frequently on second membase cluster. There are no "node went down" logs, but data lost only for few seconds.

Monitoring works this way: I added 1000 keys with test data and check this data every minute.

I have a connectivity monitoring between whole cluster nodes, I have attached graphs. Could anyone help me to solve these problems?
Re: Membase 1.7 data loss Aliaksey Kandratsenka 10/3/12 4:38 PM
Given it's memcached bucket some more details would help. Particularly
what is your client's setup ?
Re: Membase 1.7 data loss kay ru 10/3/12 10:06 PM
I use moxi:

port_listen=11220,

default_bucket_name=default,

downstream_max=0,
downstream_conn_max=0,
downstream_conn_queue_timeout=200,
downstream_timeout=400,
wait_queue_timeout=200,
connect_max_errors=5,
connect_retry_interval=30000,
connect_timeout=200,

cycle=200

moxi-cluster.cfg:


четверг, 4 октября 2012 г., 3:38:51 UTC+4 пользователь Aliaksey Kandratsenka написал:
Re: Membase 1.7 data loss Frank 10/4/12 7:45 PM
Memcached buckets are pure caching buckets and don't provide replication or persistence. So if a node goes down in the sense of crashing or rebooting, it is expected to come back without any data. If one node seems to keep going down, did you check for crashes of the memcached process (or even node reboots)?

Furthermore, as a cache data may be evicted from the cache at any point as new data is added.
So if you keep doing writes, some of the data is expected to be "lost". Using Couchbase buckets would give you persistence and replication, so data is available across node failure and no data would ever be evicted (it may be evicted from cache, but would be retrieved from disk on a cache miss). Couchbase buckets are still accessible with a pure memcached API (thanks to moxi), so you would not need to change you application to switch to Couchbase buckets.

Apologies if the caching nature and resulting behaviour was clear and you already excluded those factors; just want to make sure that the expected behaviour of memcached buckets was clear.

Cheers,

Frank



From: kay kay <kay....@gmail.com>
Reply-To: "couc...@googlegroups.com" <couc...@googlegroups.com>
Date: Wednesday, October 3, 2012 12:25 AM
To: "couc...@googlegroups.com" <couc...@googlegroups.com>
Subject: Membase 1.7 data loss
Re: Membase 1.7 data loss kay ru 10/4/12 11:28 PM
Actually problem node doesn't crash. For other nodes it just became "went down". Other nodes for the "went down" node became "went down" too. Here is the cluster log example:

2012-09-26 10:37:14.775 - warning - Node 'ns...@192.168.3.7' saw that node 'ns...@192.168.3.8' went down.
2012-09-26 10:37:15.970 - warning - Node 'ns...@192.168.3.6' saw that node 'ns...@192.168.3.8' went down.
2012-09-26 10:38:20.640 - info - Node 'ns...@192.168.3.7' saw that node 'ns...@192.168.3.8' came up.
2012-09-26 10:38:20.653 - info - Node 'ns...@192.168.3.6' saw that node 'ns...@192.168.3.8' came up.
2012-09-26 10:38:20.710 - warning - Node 'ns...@192.168.3.8' saw that node 'ns...@192.168.3.7' went down.
2012-09-26 10:38:20.710 - warning - Node 'ns...@192.168.3.8' saw that node 'ns...@192.168.3.6' went down.
2012-09-26 10:38:20.710 - info - Node 'ns...@192.168.3.8' saw that node 'ns...@192.168.3.7' came up.
2012-09-26 10:38:20.710 - info - Node 'ns...@192.168.3.8' saw that node 'ns...@192.168.3.6' came up.

The network monitoring looks fine so I don't know where the problem lies.

пятница, 5 октября 2012 г., 6:45:59 UTC+4 пользователь Frank написал:
Re: Membase 1.7 data loss Frank 10/4/12 11:42 PM

Does the log on that node show anything (e.g. That the server restarted)?

F

From: kay kay <kay....@gmail.com>
Reply-To: "couc...@googlegroups.com" <couc...@googlegroups.com>
Date: Thursday, October 4, 2012 11:28 PM
To: "couc...@googlegroups.com" <couc...@googlegroups.com>
Subject: Re: Membase 1.7 data loss

Actually problem node doesn't crash. For other nodes it just became "went down". Other nodes for the "went down" node became "went down" too. Here is the cluster log example:

2012-09-26 10:37:14.775 - warning - Node 'n...@192.168.3.7' saw that node 'n...@192.168.3.8' went down.
2012-09-26 10:37:15.970 - warning - Node 'n...@192.168.3.6' saw that node 'n...@192.168.3.8' went down.
2012-09-26 10:38:20.640 - info - Node 'n...@192.168.3.7' saw that node 'n...@192.168.3.8' came up.
2012-09-26 10:38:20.653 - info - Node 'n...@192.168.3.6' saw that node 'n...@192.168.3.8' came up.
2012-09-26 10:38:20.710 - warning - Node 'n...@192.168.3.8' saw that node 'n...@192.168.3.7' went down.
2012-09-26 10:38:20.710 - warning - Node 'n...@192.168.3.8' saw that node 'n...@192.168.3.6' went down.
2012-09-26 10:38:20.710 - info - Node 'n...@192.168.3.8' saw that node 'n...@192.168.3.7' came up.
2012-09-26 10:38:20.710 - info - Node 'n...@192.168.3.8' saw that node 'n...@192.168.3.6' came up.

The network monitoring looks fine so I don't know where the problem lies.

пятница, 5 октября 2012 г., 6:45:59 UTC+4 пользователь Frank написал:
Memcached buckets are pure caching buckets and don't provide replication or persistence. So if a node goes down in the sense of crashing or rebooting, it is expected to come back without any data. If one node seems to keep going down, did you check for crashes of the memcached process (or even node reboots)?

Furthermore, as a cache data may be evicted from the cache at any point as new data is added.
So if you keep doing writes, some of the data is expected to be "lost". Using Couchbase buckets would give you persistence and replication, so data is available across node failure and no data would ever be evicted (it may be evicted from cache, but would be retrieved from disk on a cache miss). Couchbase buckets are still accessible with a pure memcached API (thanks to moxi), so you would not need to change you application to switch to Couchbase buckets.

Apologies if the caching nature and resulting behaviour was clear and you already excluded those factors; just want to make sure that the expected behaviour of memcached buckets was clear.

Cheers,

Frank



From: kay kay <kay....@gmail.com>
Reply-To: "couc...@googlegroups.com" <couc...@googlegroups.com>
Date: Wednesday, October 3, 2012 12:25 AM
To: "couc...@googlegroups.com" <couc...@googlegroups.com>
Subject: Membase 1.7 data loss

I have faced a strange problem with data loss.

I use memcached buckets and two separate clusters with three servers for membase 1.7. Sometimes the necessary data became lost. I have added monitoring and analyzed recent logs to figure out what is wrong.

There are two type of data loss:

- data lost and never recovered (rarely) on first cluster
This situation occurs when one of the membase nodes became lost rarely on first cluster. Logs showed that data node went down and got up after few seconds. And necessary data became lost. For some reason this problem occurs only on third membase node.

- data lost for some moment and automatically recovered in few seconds (frequently) on second cluster
This situation occurs frequently on second membase cluster. There are no "node went down" logs, but data lost only for few seconds.

Monitoring works this way: I added 1000 keys with test data and check this data every minute.

I have a connectivity monitoring between whole cluster nodes, I have attached graphs. Could anyone help me to solve these problems?
Re: Membase 1.7 data loss kay ru 10/5/12 12:30 AM
Here you are, the latest ones

INFO REPORT <5881.32276.3759> 2012-10-01 07:15:21
===============================================================================
ns_log: logging ns_node_disco:5:Node 'ns...@192.168.3.8' saw that node 'ns...@192.168.3.6' went down.
ERROR REPORT <5881.401.0> 2012-10-01 07:15:21
===============================================================================
ns...@192.168.3.8:<5881.401.0>:system_stats_collector:130: lost 9 ticks
ERROR REPORT <5881.3651.1906> 2012-10-01 07:15:21
===============================================================================
ns...@192.168.3.8:<5881.3651.1906>:stats_collector:126: Dropped 10 ticks
ERROR REPORT <5881.19937.3636> 2012-10-01 07:15:21
===============================================================================
ns...@192.168.3.8:<5881.19937.3636>:stats_collector:126: Dropped 10 ticks
INFO REPORT <5881.32276.3759> 2012-10-01 07:15:21
===============================================================================
ns_log: logging ns_node_disco:5:Node 'ns...@192.168.3.8' saw that node 'ns...@192.168.3.7' went down.
INFO REPORT <5881.338.0> 2012-10-01 07:15:21
===============================================================================
ns_node_disco_log: nodes changed: ['ns...@192.168.3.7',
'ns...@192.168.3.8']
ERROR REPORT <5881.1450.1906> 2012-10-01 07:15:21
===============================================================================
ns...@192.168.3.8:<5881.1450.1906>:stats_collector:126: Dropped 10 ticks
ERROR REPORT <5881.3297.2484> 2012-10-01 07:15:21
===============================================================================
ns...@192.168.3.8:<5881.3297.2484>:stats_collector:126: Dropped 10 ticks
ERROR REPORT <5881.5199.1906> 2012-10-01 07:15:21
===============================================================================
ns...@192.168.3.8:<5881.5199.1906>:stats_collector:126: Dropped 10 ticks
INFO REPORT <5881.32276.3759> 2012-10-01 07:15:21
===============================================================================
ns_log: logging ns_node_disco:4:Node 'ns...@192.168.3.8' saw that node 'ns...@192.168.3.6' came up.

пятница, 5 октября 2012 г., 10:42:37 UTC+4 пользователь Frank написал:
Re: Membase 1.7 data loss kay ru 10/5/12 1:28 AM
I have found very strange log. Don't know if it related to my problem.

In past I had old membase servers which were in cluster, and then they were reinstalled to KVM host nodes with new OS. For some reason I found these nodes in logs among the current nodes. I didn't restart the membase cluster but removed nodes from the configuration via web control panel. How it could be possible for old membase nodes to be in current logs?

I can restart cluster now to figure out if the bug disappeared. But I guess developers need some info on this bug so I'll wait some time.

пятница, 5 октября 2012 г., 11:30:39 UTC+4 пользователь kay kay написал:
Re: Membase 1.7 data loss kay ru 10/10/12 7:24 AM
I still have problems with membase cluster.

I wrote a script which executes get requests of 1000 items.

Sometimes it gets 991, 999, 936 items. After few seconds items count became 1000 again. Membase logs doesn't show any error.

Can anybody help me?
Re: Membase 1.7 data loss Chad Kouse 10/10/12 8:21 AM
How many nodes in your cluster?  Is this a membase bucket type? Are you connecting via moxi?  The counts you're seeing, are they in the web console? Are you doing 1000 gets or 1 multiget?  What client are you using?

--
Chad Kouse

Re: Membase 1.7 data loss kay ru 10/10/12 10:04 PM
I have three nodes.
It is memcache bucket (it is faster than membase).
I connecting via moxi.
I have attached the python script to answer the rest questions.

Membase server - 1.7.2
Moxi server - 1.7.1

I appreciate your help.

среда, 10 октября 2012 г., 19:21:45 UTC+4 пользователь chadkouse написал:
Re: Membase 1.7 data loss kay ru 10/11/12 3:44 AM
Detailed logs showed me that errors occurred when moxi status shows "tot_downstream_timeout". I also noticed packet drops on ethernet interface.

Now I'll try to figure out what is wrong.

Thanks for help.

Re: Membase 1.7 data loss Frank 10/11/12 5:10 PM
Could you upgrade to 1.8.1 and moxi 1.8 ? I believe a bunch of moxi errors were fixed.

From: kay kay <kay....@gmail.com>
Reply-To: "couc...@googlegroups.com" <couc...@googlegroups.com>
Date: Thursday, October 11, 2012 3:44 AM
To: "couc...@googlegroups.com" <couc...@googlegroups.com>

Subject: Re: Membase 1.7 data loss

Detailed logs showed me that errors occurred when moxi status shows "tot_downstream_timeout". I also noticed packet drops on ethernet interface.

Now I'll try to figure out what is wrong.

Thanks for help.