Backups working :)

Wouter de Bie

unread,

Apr 27, 2011, 7:40:54 AM4/27/11

to mem...@googlegroups.com

Hi guys,

I haven't followed up on my previous post regarding the restore of backups. Last night we our membase nodes filled their disks (bad backup job from our side) and this caused membase to 'crash'. We had to restore from backups and we were able to do this. We didn't restore the configuration, but just created a new cluster, copied the backups there and fired it up.

Some observations:

- This morning, we noticed that the disk write queue had 1M items in it and that nothing was being persisted to disk. The logs showed a "SQL logic error or missing database"

- Membase was still serving from RAM, but all the data files on disk were gone (?!)

- Configuring a new cluster, removing the new data files and replacing them with the backups allowed us to run again

- Since at that moment there were about 1.5M items in the disk write queue, we tried to use TAP to stream them out of the broken cluster into the new one. Unfortunately, the backfill date only seems to work as a boolean (everything from the beginning or all new changes).

- We installed the new cluster using 1.6.5.3 since we ran into MB-3454. There was no need to upgrade the db schema.

- After the warm-up phase and having no production load on the new cluster, the disk write queue went up to 1M items. It took some time for it to settle down. What happens in this phase? During this phase, membase flushed to disk at a max rate of 20k items/s.

- Yesterday, we noticed that the disk queue grew to around 100k items and membase (with production load on it, around 1.5k ops/s) and it only flushed to disk with a rate of max 1k items/s. Currently, the rate of 1k items/s is kept and the disk write queue size is around 10k. Is there a way to increase the rate of persisting to disk?

Greetings,

Wouter

Perry Krug

unread,

Apr 27, 2011, 1:04:59 PM4/27/11

to mem...@googlegroups.com

Hey Wouter, glad you guys are back up and running, sorry you ran into some issues.

More comments inline:

I haven't followed up on my previous post regarding the restore of backups. Last night we our membase nodes filled their disks (bad backup job from our side) and this caused membase to 'crash'. We had to restore from backups and we were able to do this. We didn't restore the configuration, but just created a new cluster, copied the backups there and fired it up.

[pk] - Unfortunately this will likely not work. There's needs to be cohesion between the vbuckets stored on disk and the vbuckets that the servers are expecting to handle. We're actively working on making this easier, but with the current code base, you need to restore the configuration along with the datafiles. You also need to ensure that the IP/DNS name of the servers are the same from one cluster to another so that they match what the configuration has stored. Again, we're making it easier to deal with this.

Some observations:

- This morning, we noticed that the disk write queue had 1M items in it and that nothing was being persisted to disk. The logs showed a "SQL logic error or missing database"

[pk] - This is usually related to permissions issues on the data directory...could have been just one node. Our 1.7 release has implemented per-server monitoring so you can more easily see if there is just one offending node.

- Membase was still serving from RAM, but all the data files on disk were gone (?!)

[pk] - Missing data files would definitely be a bad thing, but we would recreate them if possible (hence the permissions theory). It may be too late to diagnose this now...

- Since at that moment there were about 1.5M items in the disk write queue, we tried to use TAP to stream them out of the broken cluster into the new one. Unfortunately, the backfill date only seems to work as a boolean (everything from the beginning or all new changes).

[pk] - Yes, that is a current limitation of TAP though we are changing this for 1.7 as well. I don't have the docs in front of me, but we've implemented a concept of "TAP checkpoints" which will allow you to stream data from a point-in-time rather than always from the beginning.

- After the warm-up phase and having no production load on the new cluster, the disk write queue went up to 1M items. It took some time for it to settle down. What happens in this phase? During this phase, membase flushed to disk at a max rate of 20k items/s.

[pk] - This is the replication rematerializing after warmup. I believe 1.7 will be improving this as well, but I'll need to check on the specific improvements.

- Yesterday, we noticed that the disk queue grew to around 100k items and membase (with production load on it, around 1.5k ops/s) and it only flushed to disk with a rate of max 1k items/s. Currently, the rate of 1k items/s is kept and the disk write queue size is around 10k. Is there a way to increase the rate of persisting to disk?

[pk] - Disk persistence speed is pretty variable based on the underlying disk speed. The reported speed can also be misleading sometimes since it aggregates the speed from all nodes. When all are writing to disk, the speed will be very high, if only one node is writing, the speed will be reported as much lower. It would be worth analyzing whether just one node has a high write queue or whether they all do and are all slow to drain. We also know that updates to data are considerably slower than new data being inserted.

Greetings,

Wouter

Thanks again Wouter, let me know how else we can help.

Perry

Wouter de Bie

unread,

Apr 27, 2011, 1:20:56 PM4/27/11

to mem...@googlegroups.com, Björn Sperber

Hi Perry,

Thanks for the elaborate response!

On onsdag den 27 april 2011 at 19.04, Perry Krug wrote:

Hey Wouter, glad you guys are back up and running, sorry you ran into some issues.

This was not the problem of membase, we made a boo-boo by letting the disk fill up :)

I haven't followed up on my previous post regarding the restore of backups. Last night we our membase nodes filled their disks (bad backup job from our side) and this caused membase to 'crash'. We had to restore from backups and we were able to do this. We didn't restore the configuration, but just created a new cluster, copied the backups there and fired it up.

[pk] - Unfortunately this will likely not work. There's needs to be cohesion between the vbuckets stored on disk and the vbuckets that the servers are expecting to handle. We're actively working on making this easier, but with the current code base, you need to restore the configuration along with the datafiles. You also need to ensure that the IP/DNS name of the servers are the same from one cluster to another so that they match what the configuration has stored. Again, we're making it easier to deal with this.

It actually worked.. We made backups on both nodes (backups from before the problem started) and fired up a similar cluster on 2 other machines. After configuring them in the same way as the old cluster (same RAM settings, vbucket names, etc), I stopped the new cluster and replaced the data files with the ones from the backup. Since we were only running with 2 nodes and had 2 identical machines available, we were able to recreate the cluster.

Some observations:

- This morning, we noticed that the disk write queue had 1M items in it and that nothing was being persisted to disk. The logs showed a "SQL logic error or missing database"

[pk] - This is usually related to permissions issues on the data directory...could have been just one node. Our 1.7 release has implemented per-server monitoring so you can more easily see if there is just one offending node.

Membase was right with the error. The database was gone..

- Membase was still serving from RAM, but all the data files on disk were gone (?!)
[pk] - Missing data files would definitely be a bad thing, but we would recreate them if possible (hence the permissions theory). It may be too late to diagnose this now...

What surprised me was that the data files were completely gone. There was no permission issue. This might be interesting to try and replicate. Let a disk fill up and see what happens. The files were not recreated afterwards. A nice feature would be to be able to let membase recreate files on disk from the state in memory. As I said, membase was still serving correctly from RAM (which I'm very thankful for!) which gave us some time to look into the issue and come up with a recovery plan. Good job in not letting the server crash completely when it couldn't write to disk.

- Since at that moment there were about 1.5M items in the disk write queue, we tried to use TAP to stream them out of the broken cluster into the new one. Unfortunately, the backfill date only seems to work as a boolean (everything from the beginning or all new changes).

[pk] - Yes, that is a current limitation of TAP though we are changing this for 1.7 as well. I don't have the docs in front of me, but we've implemented a concept of "TAP checkpoints" which will allow you to stream data from a point-in-time rather than always from the beginning.

Nice!!

- After the warm-up phase and having no production load on the new cluster, the disk write queue went up to 1M items. It took some time for it to settle down. What happens in this phase? During this phase, membase flushed to disk at a max rate of 20k items/s.

[pk] - This is the replication rematerializing after warmup. I believe 1.7 will be improving this as well, but I'll need to check on the specific improvements.

- Yesterday, we noticed that the disk queue grew to around 100k items and membase (with production load on it, around 1.5k ops/s) and it only flushed to disk with a rate of max 1k items/s. Currently, the rate of 1k items/s is kept and the disk write queue size is around 10k. Is there a way to increase the rate of persisting to disk?

[pk] - Disk persistence speed is pretty variable based on the underlying disk speed. The reported speed can also be misleading sometimes since it aggregates the speed from all nodes. When all are writing to disk, the speed will be very high, if only one node is writing, the speed will be reported as much lower. It would be worth analyzing whether just one node has a high write queue or whether they all do and are all slow to drain. We also know that updates to data are considerably slower than new data being inserted.

Ah! That makes sense. Thanks for the explanation.

Greetings,

Wouter

Perry Krug

unread,

Apr 27, 2011, 1:23:43 PM4/27/11

to mem...@googlegroups.com, Björn Sperber

Ahh yes, with only two nodes that would work because they each have the complete dataset...but keep in mind that it won't work beyond that (3 nodes would work with 3 replicas).

It would be quite interesting to reproduce...we've seen a number of cases where the disk fills up and bad things generally happen but I've never seen disk files get deleted or "disappear".

Perry Krug
Solutions Architect
direct: 831-824-4123
email: pe...@couchbase.com

2011/4/27 Wouter de Bie <pru...@gmail.com>

Wouter de Bie

unread,

Apr 27, 2011, 1:27:36 PM4/27/11

to mem...@googlegroups.com, Björn Sperber

It might be related that we were doing a backup of membase when the disk filled up. The backup was made on the same disk as membase stores its data files. Are membase data files rewritten after/during a backup?

Perry Krug

unread,

Apr 27, 2011, 1:39:11 PM4/27/11

to mem...@googlegroups.com, Björn Sperber

No, the sqlite backup API allows us to take online backups without locking or otherwise changing the origin database.

Reply all

Reply to author

Forward