I haven't followed up on my previous post regarding the restore of backups. Last night we our membase nodes filled their disks (bad backup job from our side) and this caused membase to 'crash'. We had to restore from backups and we were able to do this. We didn't restore the configuration, but just created a new cluster, copied the backups there and fired it up.
Some observations:- This morning, we noticed that the disk write queue had 1M items in it and that nothing was being persisted to disk. The logs showed a "SQL logic error or missing database"
- Membase was still serving from RAM, but all the data files on disk were gone (?!)
- Since at that moment there were about 1.5M items in the disk write queue, we tried to use TAP to stream them out of the broken cluster into the new one. Unfortunately, the backfill date only seems to work as a boolean (everything from the beginning or all new changes).
- After the warm-up phase and having no production load on the new cluster, the disk write queue went up to 1M items. It took some time for it to settle down. What happens in this phase? During this phase, membase flushed to disk at a max rate of 20k items/s.
- Yesterday, we noticed that the disk queue grew to around 100k items and membase (with production load on it, around 1.5k ops/s) and it only flushed to disk with a rate of max 1k items/s. Currently, the rate of 1k items/s is kept and the disk write queue size is around 10k. Is there a way to increase the rate of persisting to disk?
Greetings,Wouter
On onsdag den 27 april 2011 at 19.04, Perry Krug wrote:
Hey Wouter, glad you guys are back up and running, sorry you ran into some issues.
I haven't followed up on my previous post regarding the restore of backups. Last night we our membase nodes filled their disks (bad backup job from our side) and this caused membase to 'crash'. We had to restore from backups and we were able to do this. We didn't restore the configuration, but just created a new cluster, copied the backups there and fired it up.[pk] - Unfortunately this will likely not work. There's needs to be cohesion between the vbuckets stored on disk and the vbuckets that the servers are expecting to handle. We're actively working on making this easier, but with the current code base, you need to restore the configuration along with the datafiles. You also need to ensure that the IP/DNS name of the servers are the same from one cluster to another so that they match what the configuration has stored. Again, we're making it easier to deal with this.
Some observations:- This morning, we noticed that the disk write queue had 1M items in it and that nothing was being persisted to disk. The logs showed a "SQL logic error or missing database"[pk] - This is usually related to permissions issues on the data directory...could have been just one node. Our 1.7 release has implemented per-server monitoring so you can more easily see if there is just one offending node.
- Membase was still serving from RAM, but all the data files on disk were gone (?!)[pk] - Missing data files would definitely be a bad thing, but we would recreate them if possible (hence the permissions theory). It may be too late to diagnose this now...
- Since at that moment there were about 1.5M items in the disk write queue, we tried to use TAP to stream them out of the broken cluster into the new one. Unfortunately, the backfill date only seems to work as a boolean (everything from the beginning or all new changes).[pk] - Yes, that is a current limitation of TAP though we are changing this for 1.7 as well. I don't have the docs in front of me, but we've implemented a concept of "TAP checkpoints" which will allow you to stream data from a point-in-time rather than always from the beginning.
- After the warm-up phase and having no production load on the new cluster, the disk write queue went up to 1M items. It took some time for it to settle down. What happens in this phase? During this phase, membase flushed to disk at a max rate of 20k items/s.[pk] - This is the replication rematerializing after warmup. I believe 1.7 will be improving this as well, but I'll need to check on the specific improvements.- Yesterday, we noticed that the disk queue grew to around 100k items and membase (with production load on it, around 1.5k ops/s) and it only flushed to disk with a rate of max 1k items/s. Currently, the rate of 1k items/s is kept and the disk write queue size is around 10k. Is there a way to increase the rate of persisting to disk?[pk] - Disk persistence speed is pretty variable based on the underlying disk speed. The reported speed can also be misleading sometimes since it aggregates the speed from all nodes. When all are writing to disk, the speed will be very high, if only one node is writing, the speed will be reported as much lower. It would be worth analyzing whether just one node has a high write queue or whether they all do and are all slow to drain. We also know that updates to data are considerably slower than new data being inserted.
Perry Krug
Solutions Architect
direct: 831-824-4123
email: pe...@couchbase.com