Degraded DRBD and rebuild failes

Neal Oakey

unread,

May 23, 2014, 4:52:10 AM5/23/14

to gan...@googlegroups.com

Hi,

while moving / replacing disks one VM on my cluster I got the Error
"Can't change network configuration: drbd7: timeout while configuring
network (please do a gnt-instance info to see the status of disks)"
=> disk/0 has failed, but disk/1 is working fine

So I tried to repair it with "gnt-instance replace-disks -a lug-in", but
that gives me a "No disks need replacement for instance
'lug-in.xxxxxxxxx.de'"
=> disk/0 is still degraded

I run a "gnt-cluster verify" which doesn't detect anything (exempt that
the LVM Volumes are still on node2, but they are unknown)

I then tryed a "gnt-instance replace-disks -s lug-in" which failes again
with the same error as the origional "replace-disks -n node1"

Anyone an idea what I have forgotten or what I could do?

Notice: all other (17) VMs are running without problems

Greetings
Neal

> root@node2 ~ # gnt-job info 118433
> Job ID: 118433
> Status: error
> Received: 2014-05-23 02:59:34.007653
> Processing start: 2014-05-23 02:59:34.150118 (delta 0.142465s)
> Processing end: 2014-05-23 03:02:11.719716 (delta 157.569598s)
> Total processing time: 157.712063 seconds
> Opcodes:
> OP_INSTANCE_REPLACE_DISKS
> Status: error
> Processing start: 2014-05-23 02:59:34.150118
> Execution start: 2014-05-23 02:59:36.276322
> Processing end: 2014-05-23 03:02:11.719693
> Input fields:
> comment: None
> debug_level: 0
> depends: None
> disks:
> dry_run: False
> early_release: False
> ignore_ipolicy: False
> instance_name: lug-in.xxxxxxxxx.de
> instance_uuid: a7c2d2f2-a248-4191-b75f-1a2ce53bba9a
> mode: replace_new_secondary
> priority: 0
> reason: ['gnt:client:gnt-instance', 'replace-disks',
> 1400806773786305024],['gnt:opcode:op_instance_replace_disks',
> 'job=118433;index=0', 1400806774007633920]
> remote_node: node1.xxxxxxxxx.de
> remote_node_uuid: 56a75bd8-f179-40c6-a8e4-3adc283968f3
> Result:
> OpExecError
> [Can't attach drbd disks on node node1.xxxxxxxxx.de: Can't
> change network configuration: drbd7: timeout while configuring network
> (please do a gnt-instance info to see the status of disks)]
> Execution log:
> 1:2014-05-23 02:59:36.596214:message Replacing disk(s) 0, 1
> for instance 'lug-in.xxxxxxxxx.de'
> 2:2014-05-23 02:59:36.655885:message Current primary node:
> node3.xxxxxxxxx.de
> 3:2014-05-23 02:59:36.706412:message Current seconary node:
> node2.xxxxxxxxx.de
> 4:2014-05-23 02:59:36.748099:message STEP 1/6 Check device
> existence
> 5:2014-05-23 02:59:36.791096:message - INFO: Checking disk/0
> on node3.xxxxxxxxx.de
> 6:2014-05-23 02:59:37.535986:message - INFO: Checking disk/1
> on node3.xxxxxxxxx.de
> 7:2014-05-23 02:59:37.743781:message - INFO: Checking volume
> groups
> 8:2014-05-23 02:59:37.878451:message STEP 2/6 Check peer
> consistency
> 9:2014-05-23 02:59:37.901140:message - INFO: Checking disk/0
> consistency on node node3.xxxxxxxxx.de
> 10:2014-05-23 02:59:38.336108:message - INFO: Checking disk/1
> consistency on node node3.xxxxxxxxx.de
> 11:2014-05-23 02:59:38.778363:message STEP 3/6 Allocate new
> storage
> 12:2014-05-23 02:59:38.827892:message - INFO: Adding new
> local storage on node1.xxxxxxxxx.de for disk/0
> 13:2014-05-23 02:59:43.446056:message - INFO: Adding new
> local storage on node1.xxxxxxxxx.de for disk/1
> 14:2014-05-23 02:59:50.388658:message STEP 4/6 Changing drbd
> configuration
> 15:2014-05-23 02:59:50.438092:message - INFO: activating a
> new drbd on node1.xxxxxxxxx.de for disk/0
> 16:2014-05-23 02:59:59.627383:message - INFO: activating a
> new drbd on node1.xxxxxxxxx.de for disk/1
> 17:2014-05-23 03:00:06.809704:message - INFO: Shutting down
> drbd for disk/0 on old node
> 18:2014-05-23 03:00:08.667964:message - INFO: Shutting down
> drbd for disk/1 on old node
> 19:2014-05-23 03:00:09.100976:message - INFO: Detaching
> primary drbds from the network (=> standalone)
> 20:2014-05-23 03:00:09.581068:message - INFO: Updating
> instance configuration
> 21:2014-05-23 03:00:09.810649:message - INFO: Attaching
> primary drbds to new secondary (standalone => connected)

> - disk/0: drbd, size 32.0G
> access mode: rw
> nodeA: node3.xxxxxxxxx.de, minor=15
> nodeB: node1.xxxxxxxxx.de, minor=7
> port: 11021
> auth key: 8206397b5460285e6c06cf2ce1be6c8d5f8e03d3
> on primary: /dev/drbd15 (147:15) in sync, status *DEGRADED*
> on secondary: /dev/drbd7 (147:7) in sync, status *DEGRADED*
> *UNCERTAIN STATE*
> name: None
> UUID: 51f4d88f-b228-4b74-ba92-ae9fb7e6862e
> child devices:
> - child 0: plain, size 32.0G
> logical_id: lvm/843d313c-0ca3-4fe6-9a8b-7923302d711b.disk0_data
> on primary:
> /dev/lvm/843d313c-0ca3-4fe6-9a8b-7923302d711b.disk0_data (253:38)
> on secondary:
> /dev/lvm/843d313c-0ca3-4fe6-9a8b-7923302d711b.disk0_data (253:24)
> name: None
> UUID: e163561b-4b39-45a3-87e6-c507d26fc622
> - child 1: plain, size 128M
> logical_id: lvm/843d313c-0ca3-4fe6-9a8b-7923302d711b.disk0_meta
> on primary:
> /dev/lvm/843d313c-0ca3-4fe6-9a8b-7923302d711b.disk0_meta (253:39)
> on secondary:
> /dev/lvm/843d313c-0ca3-4fe6-9a8b-7923302d711b.disk0_meta (253:25)
> name: None
> UUID: 3e0180bb-e0bc-48ce-b52b-50ee6635ea7b
> - disk/1: drbd, size 128.0G
> access mode: rw
> nodeA: node3.xxxxxxxxx.de, minor=16
> nodeB: node1.xxxxxxxxx.de, minor=8
> port: 11025
> auth key: ff7e90ea4f2dc4d7ebaf2c50fe4bdfdc4369d389
> on primary: /dev/drbd16 (147:16) in sync, status ok
> on secondary: /dev/drbd8 (147:8) in sync, status ok
> name: None
> UUID: 53d76932-8a06-4137-8636-791b001bcb2d
> child devices:
> - child 0: plain, size 128.0G
> logical_id: lvm/7978a477-4f0f-43b8-be39-f87ece633c22.disk1_data
> on primary:
> /dev/lvm/7978a477-4f0f-43b8-be39-f87ece633c22.disk1_data (253:40)
> on secondary:
> /dev/lvm/7978a477-4f0f-43b8-be39-f87ece633c22.disk1_data (253:26)
> name: None
> UUID: e1e4d4ea-20e6-48ac-9cfb-dccb14c98878
> - child 1: plain, size 128M
> logical_id: lvm/7978a477-4f0f-43b8-be39-f87ece633c22.disk1_meta
> on primary:
> /dev/lvm/7978a477-4f0f-43b8-be39-f87ece633c22.disk1_meta (253:41)
> on secondary:
> /dev/lvm/7978a477-4f0f-43b8-be39-f87ece633c22.disk1_meta (253:28)
> name: None
> UUID: 51553d17-ba4f-4ae8-bdfa-8de9dbbd8452

> root@node2 ~ # gnt-instance replace-disks -a lug-in
> Job 118452 is trying to acquire all necessary locks
> Fri May 23 04:54:41 2014 - INFO: Checking disk/0 on node3.xxxxxxxxx.de
> Fri May 23 04:54:41 2014 - INFO: Checking disk/0 on node1.xxxxxxxxx.de
> Fri May 23 04:54:41 2014 - INFO: Checking disk/1 on node3.xxxxxxxxx.de
> Fri May 23 04:54:42 2014 - INFO: Checking disk/1 on node1.xxxxxxxxx.de
> Fri May 23 04:54:44 2014 No disks need replacement for instance
> 'lug-in.xxxxxxxxx.de'

> root@node2 ~ # gnt-cluster verify
> Submitted jobs 118580, 118581
> Waiting for job 118580 ...
> Fri May 23 10:28:43 2014 * Verifying cluster config
> Fri May 23 10:28:43 2014 * Verifying cluster certificate files
> Fri May 23 10:28:43 2014 * Verifying hypervisor parameters
> Fri May 23 10:28:43 2014 * Verifying all nodes belong to an existing group
> Waiting for job 118581 ...
> Fri May 23 10:28:44 2014 * Verifying group 'default'
> Fri May 23 10:28:44 2014 * Gathering data (3 nodes)
> Fri May 23 10:28:48 2014 * Gathering disk information (3 nodes)
> Fri May 23 10:28:58 2014 * Verifying configuration file consistency
> Fri May 23 10:28:58 2014 * Verifying node status
> Fri May 23 10:28:58 2014 * Verifying instance status
> Fri May 23 10:28:58 2014 * Verifying orphan volumes
> Fri May 23 10:28:58 2014 - ERROR: node node2.xxxxxxxxx.de: volume
> lvm/7978a477-4f0f-43b8-be39-f87ece633c22.disk1_data is unknown
> Fri May 23 10:28:58 2014 - ERROR: node node2.xxxxxxxxx.de: volume
> lvm/843d313c-0ca3-4fe6-9a8b-7923302d711b.disk0_meta is unknown
> Fri May 23 10:28:58 2014 - ERROR: node node2.xxxxxxxxx.de: volume
> lvm/843d313c-0ca3-4fe6-9a8b-7923302d711b.disk0_data is unknown
> Fri May 23 10:28:58 2014 - ERROR: node node2.xxxxxxxxx.de: volume
> lvm/7978a477-4f0f-43b8-be39-f87ece633c22.disk1_meta is unknown
> Fri May 23 10:28:58 2014 * Verifying N+1 Memory redundancy
> Fri May 23 10:28:58 2014 * Other Notes
> Fri May 23 10:28:59 2014 - NOTICE: 1 non-redundant instance(s) found.
> Fri May 23 10:28:59 2014 - NOTICE: 3 non-auto-balanced instance(s)
> found.
> Fri May 23 10:28:59 2014 * Hooks Results

> root@node2 ~ # ssh node1 cat /proc/drbd
> version: 8.3.11 (api:88/proto:86-96)
> srcversion: F937DCB2E5D83C6CCE4A6C9
> [...]
> 7: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown r-----
> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f
> oos:33554176

> root@node2 ~ # ssh node3 cat /proc/drbd
> version: 8.3.11 (api:88/proto:86-96)
> srcversion: F937DCB2E5D83C6CCE4A6C9
> [...]
> 15: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
> ns:0 nr:0 dw:151228 dr:532788 al:217 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1
> wo:f oos:73096

> root@node2 ~ # gnt-job info 118464
> Job ID: 118464
> Status: error
> Received: 2014-05-23 05:16:18.058327
> Processing start: 2014-05-23 05:16:18.565548 (delta 0.507221s)
> Processing end: 2014-05-23 06:02:05.403075 (delta 2746.837527s)
> Total processing time: 2747.344748 seconds
> Opcodes:
> OP_INSTANCE_REPLACE_DISKS
> Status: error
> Processing start: 2014-05-23 05:16:18.565548
> Execution start: 2014-05-23 06:01:45.634275
> Processing end: 2014-05-23 06:02:05.402960
> Input fields:
> comment: None
> debug_level: 0
> depends: None
> disks:
> dry_run: False
> early_release: False
> ignore_ipolicy: False
> instance_name: lug-in.xxxxxxxxx.de
> instance_uuid: a7c2d2f2-a248-4191-b75f-1a2ce53bba9a
> mode: replace_on_secondary
> priority: 0
> reason: ['gnt:client:gnt-instance', 'replace-disks',
> 1400814977550833920],['gnt:opcode:op_instance_replace_disks',
> 'job=118464;index=0', 1400814978058324224]
> Result:
> OpPrereqError
> [Instance 'lug-in.xxxxxxxxx.de' is marked to be up, cannot
> shutdown disks, wrong_state]
> Execution log:
> 1:2014-05-23 06:01:45.937374:message Replacing disk(s) 0, 1
> for instance 'lug-in.xxxxxxxxx.de'
> 2:2014-05-23 06:01:45.990016:message Current primary node:
> node3.xxxxxxxxx.de
> 3:2014-05-23 06:01:46.073593:message Current seconary node:
> node1.xxxxxxxxx.de
> 4:2014-05-23 06:01:58.846739:message - WARNING: Could not
> prepare block device disk/0 on node node1.xxxxxxxxx.de
> (is_primary=False, pass=1): Error while assembling disk: drbd7:
> timeout while configuring network
> 5:2014-05-23 06:02:02.714689:message STEP 1/6 Check device
> existence
> 6:2014-05-23 06:02:02.797834:message - INFO: Checking disk/0
> on node3.xxxxxxxxx.de
> 7:2014-05-23 06:02:03.112653:message - INFO: Checking disk/0
> on node1.xxxxxxxxx.de
> 8:2014-05-23 06:02:03.417481:message - INFO: Checking disk/1
> on node3.xxxxxxxxx.de
> 9:2014-05-23 06:02:03.690932:message - INFO: Checking disk/1
> on node1.xxxxxxxxx.de
> 10:2014-05-23 06:02:03.949663:message - INFO: Checking volume
> groups
> 11:2014-05-23 06:02:05.040533:message STEP 2/6 Check peer
> consistency
> 12:2014-05-23 06:02:05.141663:message - INFO: Checking disk/0
> consistency on node node3.xxxxxxxxx.de

Thomas Thrainer

unread,

May 23, 2014, 5:48:31 AM5/23/14

to gan...@googlegroups.com

Hi,

I guess the error message tells you that `drbdsetup net ...` timed out. Could you verify this in /var/log/ganeti/node-daemon.log on the affected nodes?

In order to establish the DRBD net connection again, a `gnt-instance activate-disks` would normally be enough. We encounter, however, sometimes DRBD bugs which cause calls to drbdsetup to hang forever (probably due to a deadlock in the kernel module). In this case, only a reboot of the node would solve the issue :(.

Cheers,

Thomas

--

Thomas Thrainer | Software Engineer | thom...@google.com |

Google Germany GmbH

Dienerstr. 12

80331 München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Graham Law, Christine Elizabeth Flores

Neal Oakey

unread,

May 24, 2014, 1:15:49 AM5/24/14

to gan...@googlegroups.com

Hi Thomas,

so, i have now rebooted node2 and node1, but still I'm getting some strange errors.

Here some details: (from gnt-cluster verify, a replace-disk try and some daemon.log | grep -v INFO)

Greetings,
Neal

root@node1 ~ # gnt-cluster verify [...] Sat May 24 07:08:41 2014 * Verifying node statusSat May 24 07:08:41 2014 - ERROR: node node2.xxxxxxxxxxx.de: unallocated drbd minor 7 is in use
[...]

root@node1 ~ # gnt-instance replace-disks -n node2 heiden Sat May 24 07:03:04 2014 Replacing disk(s) 0 for instance 'heiden.xxxxxxxxxxx.de' Sat May 24 07:03:04 2014 Current primary node: node1.xxxxxxxxxxx.de Sat May 24 07:03:04 2014 Current seconary node: node3.xxxxxxxxxxx.de Sat May 24 07:03:04 2014 STEP 1/6 Check device existence Sat May 24 07:03:04 2014 - INFO: Checking disk/0 on node1.xxxxxxxxxxx.de Sat May 24 07:03:04 2014 - INFO: Checking volume groups Sat May 24 07:03:04 2014 STEP 2/6 Check peer consistency Sat May 24 07:03:04 2014 - INFO: Checking disk/0 consistency on node node1.xxxxxxxxxxx.de Sat May 24 07:03:06 2014 STEP 3/6 Allocate new storage Sat May 24 07:03:06 2014 - INFO: Adding new local storage on node2.xxxxxxxxxxx.de for disk/0 Sat May 24 07:03:09 2014 STEP 4/6 Changing drbd configuration Sat May 24 07:03:09 2014 - INFO: activating a new drbd on node2.xxxxxxxxxxx.de for disk/0 Failure: command execution error: Can't create block device <DRBD8(hosts=56a75bd8-f179-40c6-a8e4-3adc283968f3/5-cb9e592c-97c3-45c4-909d-af6ef53fa5b6/7, port=None, backend=<LogicalVolume(/dev/lvm/45781c6b-341e-430b-a0b2-743259b326e4.disk0_data, not visible, size=32768m)>, metadev=<LogicalVolume(/dev/lvm/45781c6b-341e-430b-a0b2-743259b326e4.disk0_meta, not visible, size=128m)>, not visible, size=32768m)> on node node2.xxxxxxxxxxx.de for instance heiden.xxxxxxxxxxx.de: Can't create block device: drbd7: minor is already in use at Create() time

root@node1 ~ # cat /var/log/ganeti/node-daemon.log | grep -v INFO [...] 2014-05-24 07:08:39,872: ganeti-noded pid=9279 WARNING Can't read from device /dev/lvm/1501ed5c-9b56-4797-8e37-3d8e5a5f0f9f.disk0_metaTraceback (most recent call last): File "/usr/share/ganeti/2.10/ganeti/storage/drbd.py", line 1076, in _CanReadDevice utils.ReadFile(path, size=_DEVICE_READ_SIZE) File "/usr/share/ganeti/2.10/ganeti/utils/io.py", line 104, in ReadFile f = open(file_name, "r")IOError: [Errno 2] No such file or directory: '/dev/lvm/1501ed5c-9b56-4797-8e37-3d8e5a5f0f9f.disk0_meta'2014-05-24 07:08:39,970: ganeti-noded pid=9279 WARNING Can't read from device /dev/lvm/aa90650f-f979-4bfa-aad4-e33d73998714.disk0_metaTraceback (most recent call last): File "/usr/share/ganeti/2.10/ganeti/storage/drbd.py", line 1076, in _CanReadDevice utils.ReadFile(path, size=_DEVICE_READ_SIZE) File "/usr/share/ganeti/2.10/ganeti/utils/io.py", line 104, in ReadFile f = open(file_name, "r")IOError: [Errno 2] No such file or directory: '/dev/lvm/aa90650f-f979-4bfa-aad4-e33d73998714.disk0_meta'2014-05-24 07:08:40,450: ganeti-noded pid=9279 WARNING Can't read from device /dev/lvm/d22084d6-1646-4a30-a97a-2146fc9ddbe3.disk0_metaTraceback (most recent call last): File "/usr/share/ganeti/2.10/ganeti/storage/drbd.py", line 1076, in _CanReadDevice utils.ReadFile(path, size=_DEVICE_READ_SIZE) File "/usr/share/ganeti/2.10/ganeti/utils/io.py", line 104, in ReadFile f = open(file_name, "r")IOError: [Errno 2] No such file or directory: '/dev/lvm/d22084d6-1646-4a30-a97a-2146fc9ddbe3.disk0_meta'2014-05-24 07:10:07,084: ganeti-noded pid=10028 WARNING Can't read from device /dev/lvm/1501ed5c-9b56-4797-8e37-3d8e5a5f0f9f.disk0_metaTraceback (most recent call last): File "/usr/share/ganeti/2.10/ganeti/storage/drbd.py", line 1076, in _CanReadDevice utils.ReadFile(path, size=_DEVICE_READ_SIZE) File "/usr/share/ganeti/2.10/ganeti/utils/io.py", line 104, in ReadFile f = open(file_name, "r")IOError: [Errno 2] No such file or directory: '/dev/lvm/1501ed5c-9b56-4797-8e37-3d8e5a5f0f9f.disk0_meta'2014-05-24 07:10:09,126: ganeti-noded pid=10028 WARNING Can't read from device /dev/lvm/aa90650f-f979-4bfa-aad4-e33d73998714.disk0_metaTraceback (most recent call last): File "/usr/share/ganeti/2.10/ganeti/storage/drbd.py", line 1076, in _CanReadDevice utils.ReadFile(path, size=_DEVICE_READ_SIZE) File "/usr/share/ganeti/2.10/ganeti/utils/io.py", line 104, in ReadFile f = open(file_name, "r")IOError: [Errno 2] No such file or directory: '/dev/lvm/aa90650f-f979-4bfa-aad4-e33d73998714.disk0_meta'2014-05-24 07:10:11,660: ganeti-noded pid=10028 WARNING Can't read from device /dev/lvm/d22084d6-1646-4a30-a97a-2146fc9ddbe3.disk0_metaTraceback (most recent call last): File "/usr/share/ganeti/2.10/ganeti/storage/drbd.py", line 1076, in _CanReadDevice utils.ReadFile(path, size=_DEVICE_READ_SIZE) File "/usr/share/ganeti/2.10/ganeti/utils/io.py", line 104, in ReadFile f = open(file_name, "r")IOError: [Errno 2] No such file or directory: '/dev/lvm/d22084d6-1646-4a30-a97a-2146fc9ddbe3.disk0_meta'[...]

Reply all

Reply to author

Forward