I have a two-node ganeti-2.11 cluster where one of the drbd instances is in a bad state:
# drbd-overview
0:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r-----
1:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r-----
2:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r-----
4:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r-----
5:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r-----
6:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r-----
7:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r-----
10:??not-found?? StandAlone Primary/Unknown UpToDate/DUnknown r-----
11:??not-found?? Connected Primary/Secondary UpToDate/UpToDate C r-----
The other side shows:
...
10:??not-found?? StandAlone Secondary/Unknown UpToDate/DUnknown
...
I have found the instance which uses /dev/drbd10. gnt-instance info shows:
Disk template: drbd
Disks:
- disk/0: drbd, size 20.0G
access mode: rw
port: 11026
auth key: 416bdb66767e7664133766f7e2b41d55af1ecdb0
on primary: /dev/drbd10 (147:10) in sync, status *DEGRADED*
on secondary: /dev/drbd10 (147:10) in sync, status *DEGRADED*
name: None
UUID: be8c8774-6727-4c82-b83a-ff51e5aca425
child devices:
- child 0: plain, size 20.0G
logical_id: xenvg/f50973f4-e40f-46fb-bee8-831cecaf051c.disk0_data
on primary: /dev/xenvg/f50973f4-e40f-46fb-bee8-831cecaf051c.disk0_data (253:25)
on secondary: /dev/xenvg/f50973f4-e40f-46fb-bee8-831cecaf051c.disk0_data (253:23)
name: None
UUID: 53b1b9f3-5d33-449f-b6c1-59a95d861379
- child 1: plain, size 128M
logical_id: xenvg/f50973f4-e40f-46fb-bee8-831cecaf051c.disk0_meta
on primary: /dev/xenvg/f50973f4-e40f-46fb-bee8-831cecaf051c.disk0_meta (253:26)
on secondary: /dev/xenvg/f50973f4-e40f-46fb-bee8-831cecaf051c.disk0_meta (253:24)
name: None
UUID: 02330058-838a-4ae3-ba6f-6085aa00bdf5
(Aside: I don't understand how it can be "in sync" and yet "degraded" at the same time...)
dmesg | grep drbd10 shows:
[9623265.691560] block drbd10: conn( StandAlone -> Unconnected )
[9623265.697449] block drbd10: Starting receiver thread (from drbd10_worker [4780])
[9623265.705032] block drbd10: receiver (re)started
[9623265.709782] block drbd10: conn( Unconnected -> WFConnection )
[9623266.214534] block drbd10: Handshake successful: Agreed network protocol version 96
[9623266.229363] block drbd10: Peer authenticated using 16 bytes of 'md5' HMAC
[9623266.236380] block drbd10: conn( WFConnection -> WFReportParams )
[9623266.242686] block drbd10: Starting asender thread (from drbd10_receiver [20302])
[9623266.250333] block drbd10: data-integrity-alg: <not-used>
[9623266.255853] block drbd10: drbd_sync_handshake:
[9623266.260498] block drbd10: self B5BE145BFFB86D7B:D66CEA14B365CF09:AA37400902A6071A:AA36400902A6071B bits:40099 flags:0
[9623266.271347] block drbd10: peer 6CD10A4C30600E74:D66CEA14B365CF09:AA37400902A6071B:AA36400902A6071B bits:11 flags:0
[9623266.281898] block drbd10: uuid_compare()=100 by rule 90
[9623266.287320] block drbd10: helper command: /bin/true initial-split-brain minor-10
[9623266.295266] block drbd10: helper command: /bin/true initial-split-brain minor-10 exit code 0 (0x0)
[9623266.304451] block drbd10: Split-Brain detected but unresolved, dropping connection!
[9623266.312319] block drbd10: helper command: /bin/true split-brain minor-10
[9623266.319519] block drbd10: helper command: /bin/true split-brain minor-10 exit code 0 (0x0)
[9623266.328089] block drbd10: conn( WFReportParams -> Disconnecting )
[9623266.334564] block drbd10: error receiving ReportState, l: 4!
[9623266.340521] block drbd10: asender terminated
[9623266.344991] block drbd10: Terminating drbd10_asender
[9623266.345037] block drbd10: Connection closed
[9623266.345043] block drbd10: conn( Disconnecting -> StandAlone )
[9623266.345055] block drbd10: receiver terminated
[9623266.345056] block drbd10: Terminating drbd10_receiver
OK, so it's a split brain. The instance is running on node1, so I just want to force it to resync so that node2 has an identical copy (i.e. sync primary to secondary). But how to do this on a 2-node cluster?
I could convert to plain and back to drbd again, but that would involve shutting down the instance twice for an extended time, which I'd rather avoid.
I tried using gnt-instance replace-disks -s, but that didn't work:
root@wrn-vm1:~# gnt-instance replace-disks -s instance-name
Mon Nov 7 17:53:51 2016 STEP 1/6 Check device existence
Mon Nov 7 17:53:53 2016 - INFO: Checking volume groups
Mon Nov 7 17:53:54 2016 STEP 2/6 Check peer consistency
Failure: command execution error:
There doesn't seem to be an option to force it. Using "-a" instead of "-s" it says there's no problem:
root@wrn-vm1:~# gnt-instance replace-disks -a instance-name
Any other clues for how to proceed?
Thanks,
Brian.