Ganeti instances with degraded disks

377 views
Skip to first unread message

John McNally

unread,
Mar 25, 2016, 10:48:23 AM3/25/16
to ganeti
Dear Ganeti Experts,

I have a two-node v2.5.1 cluster (I know it's old, I am planning to upgrade soon) with a number of DRBD disks showing as *DEGRADED* according to "gnt-instance info". When I do a "cat /proc/drbd", for thirteen total instances, 3 are status "Connected", 7 are "WFConnection", 2 are "Unconnected" and 1 is "Standalone" (see output below.) I have tried "gnt-instance activate-disks <instance>" for each instance. This has resolved sync problems in the past, but not now. Would it help to simply mark the master candidate offline, then add it back to the cluster again?

Otherwise, can you offer any specific advice on getting these DRBD disks healthy again?

Much obliged.

------------

CLUSTER DETAILS
OS: CentOS 5.8
Ganeti: v2.5.1
DRBD: v8.3.13
Master node, primary for all instances: malt-a
Master candidate node, secondary for all instances: malt-b

------------

[root@malt-a ~]# gnt-instance info --all | grep -E 'Instance name:|drbd'

Instance name: foremantest2.dmz.psfc.coop

  Disk template: drbd

    - disk/0: drbd8, size 20.0G

      on primary:   /dev/drbd3 (147:3) in sync, status *DEGRADED*

      on secondary: /dev/drbd3 (147:3) in sync, status *DEGRADED*

Instance name: www-a.dmz.psfc.coop

  Disk template: drbd

    - disk/0: drbd8, size 50.0G

      on primary:   /dev/drbd9 (147:9) in sync, status ok

      on secondary: /dev/drbd9 (147:9) in sync, status ok

Instance name: tools.dmz.psfc.coop

  Disk template: drbd

    - disk/0: drbd8, size 50.0G

      on primary:   /dev/drbd5 (147:5) in sync, status ok

      on secondary: /dev/drbd5 (147:5) in sync, status ok

Instance name: kasha-dev.dmz.psfc.coop

  Disk template: drbd

    - disk/0: drbd8, size 20.0G

      on primary:   /dev/drbd8 (147:8) in sync, status *DEGRADED*

      on secondary: /dev/drbd8 (147:8) in sync, status *DEGRADED*

Instance name: kasha2.dmz.psfc.coop

  Disk template: drbd

    - disk/0: drbd8, size 500.0G

      on primary:   /dev/drbd7 (147:7) in sync, status ok

      on secondary: /dev/drbd7 (147:7) in sync, status ok

Instance name: stg-a.dmz.psfc.coop

  Disk template: drbd

    - disk/0: drbd8, size 50.0G

      on primary:   /dev/drbd2 (147:2) in sync, status *DEGRADED*

      on secondary: /dev/drbd2 (147:2) in sync, status *DEGRADED*

Instance name: stg-b.dmz.psfc.coop

  Disk template: drbd

    - disk/0: drbd8, size 30.0G

      on primary:   /dev/drbd12 (147:12) in sync, status *DEGRADED*

      on secondary: /dev/drbd12 (147:12) in sync, status *DEGRADED*

Instance name: www-b.dmz.psfc.coop

  Disk template: drbd

    - disk/0: drbd8, size 30.0G

      on primary:   /dev/drbd10 (147:10) in sync, status *DEGRADED*

      on secondary: /dev/drbd10 (147:10) in sync, status *DEGRADED*

Instance name: millet.dmz.psfc.coop

  Disk template: drbd

    - disk/0: drbd8, size 10.0G

      on primary:   /dev/drbd6 (147:6) in sync, status *DEGRADED*

      on secondary: /dev/drbd6 (147:6) in sync, status *DEGRADED*

Instance name: www-dev.dmz.psfc.coop

  Disk template: drbd

    - disk/0: drbd8, size 30.0G

      on primary:   /dev/drbd1 (147:1) in sync, status *DEGRADED*

      on secondary: /dev/drbd1 (147:1) in sync, status *DEGRADED* *UNCERTAIN STATE*

Instance name: mem-services-dev.dmz.psfc.coop

  Disk template: drbd

    - disk/0: drbd8, size 30.0G

      on primary:   /dev/drbd0 (147:0) in sync, status *DEGRADED*

      on secondary: /dev/drbd0 (147:0) in sync, status *DEGRADED*

Instance name: git.dmz.psfc.coop

  Disk template: drbd

    - disk/0: drbd8, size 50.0G

      on primary:   /dev/drbd11 (147:11) in sync, status *DEGRADED*

      on secondary: /dev/drbd11 (147:11) in sync, status *DEGRADED* *UNCERTAIN STATE*

Instance name: stats.dmz.psfc.coop

  Disk template: drbd

    - disk/0: drbd8, size 50.0G

      on primary:   /dev/drbd4 (147:4) in sync, status *DEGRADED*

      on secondary: /dev/drbd4 (147:4) in sync, status *DEGRADED*

 

[root@malt-a ~]# cat /proc/drbd

version: 8.3.13 (api:88/proto:86-96)

GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by mock...@builder10.centos.org, 2012-05-07 11:56:36

 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----

    ns:20590348 nr:0 dw:120326528 dr:428048 al:1396 bm:2999 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:443056

 1: cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown C r-----

    ns:24734104 nr:0 dw:40641156 dr:2352476 al:1624 bm:5368 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:128720

 2: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----

    ns:42853656 nr:0 dw:491678348 dr:7931808 al:3712 bm:1937 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:392304

 3: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----

    ns:29931016 nr:0 dw:12403524 dr:21760196 al:265 bm:5546 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:42316

 4: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----

    ns:0 nr:0 dw:647856644 dr:163060488 al:38115 bm:32159 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:160804

 5: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----

    ns:325873172 nr:0 dw:1330523104 dr:27387356 al:309574 bm:1898 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

 6: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown   r-----

    ns:132 nr:0 dw:183343736 dr:387660 al:824 bm:10180 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:16

 7: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----

    ns:1594629336 nr:0 dw:522623992 dr:832856364 al:240839478 bm:125735 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

 8: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----

    ns:68453192 nr:0 dw:357062476 dr:21324512 al:23932 bm:6436 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:310636

 9: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----

    ns:245827428 nr:0 dw:1074208848 dr:49579992 al:84739 bm:12550 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

10: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----

    ns:8593792 nr:0 dw:14491084 dr:946532 al:1067 bm:3571 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:2188612

11: cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown C r-----

    ns:335417740 nr:0 dw:642153224 dr:228816708 al:104007 bm:21925 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:29340156

12: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----

    ns:537112 nr:0 dw:26431452 dr:1612976 al:4585 bm:254 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:6186148


[root@malt-b ~]# cat /proc/drbd

version: 8.3.13 (api:88/proto:86-96)

GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by mock...@builder10.centos.org, 2012-05-07 11:56:36

 0: cs:Unconnected ro:Secondary/Unknown ds:UpToDate/DUnknown C r-----

    ns:0 nr:20577512 dw:34971756 dr:0 al:0 bm:2741 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

 1: cs:WFConnection ro:Secondary/Unknown ds:Outdated/DUnknown C r-----

    ns:0 nr:24715024 dw:36188304 dr:0 al:0 bm:5317 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

 2: cs:Unconnected ro:Secondary/Unknown ds:UpToDate/DUnknown C r-----

    ns:0 nr:42852488 dw:91637516 dr:0 al:0 bm:1526 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

 3: cs:Unconnected ro:Secondary/Unknown ds:UpToDate/DUnknown C r-----

    ns:0 nr:29914412 dw:29914412 dr:0 al:0 bm:5452 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

 4: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r-----

    ns:0 nr:0 dw:373000312 dr:0 al:0 bm:31840 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

 5: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----

    ns:0 nr:325867880 dw:465789724 dr:0 al:0 bm:1359 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

 6: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r-----

    ns:0 nr:132 dw:65174068 dr:212 al:113 bm:9964 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:144376

 7: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----

    ns:0 nr:1594590392 dw:729824828 dr:0 al:0 bm:15328 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

 8: cs:Unconnected ro:Secondary/Unknown ds:UpToDate/DUnknown C r-----

    ns:0 nr:68439364 dw:106481568 dr:0 al:0 bm:5409 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

 9: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----

    ns:0 nr:245812436 dw:355606096 dr:0 al:0 bm:11833 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

10: cs:Unconnected ro:Secondary/Unknown ds:UpToDate/DUnknown C r-----

    ns:0 nr:8575508 dw:8575508 dr:0 al:0 bm:3473 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

11: cs:WFConnection ro:Secondary/Unknown ds:Outdated/DUnknown C r-----

    ns:0 nr:335398008 dw:407648056 dr:0 al:0 bm:21650 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

12: cs:Unconnected ro:Secondary/Unknown ds:UpToDate/DUnknown C r-----

    ns:0 nr:536204 dw:536204 dr:0 al:0 bm:134 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0





Viktor Bachraty

unread,
Mar 29, 2016, 6:02:51 AM3/29/16
to gan...@googlegroups.com

Hi John, 

This seems to me as a split-brain (-like) situation with DRBD (at least first two instances with degraded disks - both disks think they are UpToDate but they are unconnected), you can verify that in your syslog. What usually helps is replacing the the secondary disk:

gnt-instance replace-disks -s <instance>

Alternatively, you can also use drbdadm to disconnect the disks, invalidate the outdated victim and reconnect:

John McNally

unread,
Apr 13, 2016, 4:55:34 PM4/13/16
to gan...@googlegroups.com
Viktor,

Thanks for the response. I am now trying to fix one of the degraded DRBD pairs, as instructed in the "resolve-split-brain" doc. I have selected a less important instance "foremantest2" and it's associated disks "/dev/drbd3":

[root@malt-a ~]# gnt-instance info foremantest2 | grep -E 'Instance name:|drbd'
  Disk template: drbd
    - disk/0: drbd8, size 20.0G
      on primary:   /dev/drbd3 (147:3) in sync, status *DEGRADED*
      on secondary: /dev/drbd3 (147:3) in sync, status *DEGRADED*

[root@malt-a ~]# cat /proc/drbd | grep -A 1 "3: "
 3: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
    ns:0 nr:0 dw:80716 dr:178688 al:157 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:72568

However, when I run drbdadm, it complains that "no resources are defined":

[root@malt-a ~]# drbdadm secondary 3
no resources defined!

Am I not specifying the resource correctly? How do I determine the resource name that drbdadm expects?

Thanks.

 
_________________
John McNally
(718) 834-0549
jmcn...@acm.org

candlerb

unread,
Apr 14, 2016, 8:05:23 AM4/14/16
to ganeti, jmcn...@acm.org
> However, when I run drbdadm, it complains that "no resources are defined":

That's correct. drbdadm controls resources which are defined statically in /etc/drbd.conf, but Ganeti doesn't use this. Instead it uses drbdsetup to create drbd devices dynamically. drbdadm is not going to help you.

The command "drbd-overview" may show the status for you (if your machine has it), but is unlikely to tell you any more than cat /proc/drbd.

Reply all
Reply to author
Forward
0 new messages