On May 10, 2013, at 1:31 AM, Thomas Thrainer <
thom...@google.com> wrote:
> Unfortunately, gnt-cluster verify-disks does not handle the StandAlone state right now (it only fixes the Unconfigured state AFAIK). Also, gnt-cluster verify won't notify you about StandAlone devices. That's two outstanding issues on my todo list...
>
> For me a gnt-instance activate-disks <instance> did the trick to reconnect the primary and secondary.
>
> If DRBD fails to reconnect the nodes (i.e. because of a split-brain or similar), you might run a gnt-instance replace-disks -n <new_node> <instance>. This will, however, change the secondary and trigger a complete resync. You can switch the secondary back afterwards, but this creates quite some traffic unfortunately.
I found a method which seems to work. I had tried this, but it fails if the instance is running (or probably even if the disks are active.) The error though does not make it sound like stopping the instance would help.
# gnt-instance replace-disks -s vm1
Fri May 10 15:41:19 2013 Replacing disk(s) 0 for instance 'vm1.local'
Fri May 10 15:41:19 2013 Current primary node: vhip2.local
Fri May 10 15:41:19 2013 Current seconary node: vhip1.local
Fri May 10 15:41:19 2013 STEP 1/6 Check device existence
Fri May 10 15:41:19 2013 - INFO: Checking disk/0 on vhip2.local
Fri May 10 15:41:19 2013 - INFO: Checking disk/0 on vhip1.local
Fri May 10 15:41:19 2013 - INFO: Checking volume groups
Fri May 10 15:41:19 2013 STEP 2/6 Check peer consistency
Fri May 10 15:41:19 2013 - INFO: Checking disk/0 consistency on node vhip2.local
Failure: command execution error:
Node vhip2.local has degraded storage, unsafe to replace disks for instance vm1.local
If I shutdown the instance it works fine.
# gnt-instance replace-disks -s vm1
Fri May 10 15:42:51 2013 Replacing disk(s) 0 for instance 'vm1.local'
Fri May 10 15:42:51 2013 Current primary node: vhip2.local
Fri May 10 15:42:51 2013 Current seconary node: vhip1.local
Fri May 10 15:42:52 2013 STEP 1/6 Check device existence
Fri May 10 15:42:52 2013 - INFO: Checking disk/0 on vhip2.local
Fri May 10 15:42:53 2013 - INFO: Checking disk/0 on vhip1.local
Fri May 10 15:42:53 2013 - INFO: Checking volume groups
Fri May 10 15:42:53 2013 STEP 2/6 Check peer consistency
Fri May 10 15:42:53 2013 - INFO: Checking disk/0 consistency on node vhip2.local
Fri May 10 15:42:53 2013 STEP 3/6 Allocate new storage
Fri May 10 15:42:53 2013 - INFO: Adding storage on vhip1.local for disk/0
Fri May 10 15:42:54 2013 STEP 4/6 Changing drbd configuration
Fri May 10 15:42:54 2013 - INFO: Detaching disk/0 drbd from local storage
Fri May 10 15:42:54 2013 - INFO: Renaming the old LVs on the target node
Fri May 10 15:42:55 2013 - INFO: Renaming the new LVs on the target node
Fri May 10 15:42:55 2013 - INFO: Adding new mirror component on vhip1.local
Fri May 10 15:42:55 2013 STEP 5/6 Sync devices
Fri May 10 15:42:55 2013 - INFO: Waiting for instance vm1.local to sync disks
Fri May 10 15:42:56 2013 - INFO: - device disk/0: 0.60% done, 3m 28s remaining (estimated)
Fri May 10 15:43:56 2013 - INFO: Instance vm1.local's disks are in sync
Fri May 10 15:43:56 2013 STEP 6/6 Removing old storage
Fri May 10 15:43:56 2013 - INFO: Remove logical volumes for disk/0
So this is a workable solution. It would be nice if it worked when the instance was running as well. The DRBD resource on the secondary is essentially trash at this point, which is why it's being replaced. It doesn't seem like whether the instance is running makes this operation any more or less dangerous. Since a full resync could potentially take hours in some cases it would be really nice if the instance could be running, otherwise you are looking at a lot of unnecessary downtime during the recovery.