DRBD: WFBitMap{S,T} on one instance after cluster rebooting

Georg Faerber

unread,

Aug 21, 2017, 12:40:04 PM8/21/17

to gan...@googlegroups.com

Hi all,

Environment
-----------
- Debian jessie
- drbd-utils 8.9.5-1~bpo8+1
- ganeti 2.15.2-1~bpo8+1
- kernel 3.16.43-2+deb8u3
- Small two node cluster, DRBD uses a dedicated, directly connected
1Gb interface (no PL, no buffer overruns, latency <1ms; it seems
everything is fine regarding the network)
- ~60 instances

Problem
-------
I'm seeing the following: After a cluster reboot (shutdown all the
instances, reboot the nodes, start all instances), the DRBD resource of
the instance which was added the last to the cluster gets stuck (the
following logs are produced by only one node, the ganeti master):

/proc/drbd
19: cs:WFBitMapT ro:Secondary/Primary ds:Outdated/UpToDate C r---d- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:21096

gnt-instance info
[...]
on primary: /dev/drbd19 (147:19) *RECOVERING* 0.00%, ETA unknown, status *DEGRADED*
on secondary: /dev/drbd19 (147:19) *RECOVERING* 0.00%, ETA unknown, status *DEGRADED*

syslog
[...]
Aug 18 22:03:46 gnt1 kernel: [ 1028.781993] block drbd19: peer( Primary -> Secondary )
Aug 18 22:03:46 gnt1 kernel: [ 1028.782021] drbd resource19: peer( Secondary -> Unknown ) conn( WFBitMapT -> TearDown ) pdsk( UpToDate -> DUnknown )
Aug 18 22:03:46 gnt1 kernel: [ 1028.782047] drbd resource19: asender terminated
Aug 18 22:03:46 gnt1 kernel: [ 1028.782050] drbd resource19: Terminating drbd_a_resource
Aug 18 22:03:46 gnt1 kernel: [ 1028.793353] drbd resource19: Connection closed
Aug 18 22:03:46 gnt1 kernel: [ 1028.793360] drbd resource19: conn( TearDown -> Unconnected )
Aug 18 22:03:46 gnt1 kernel: [ 1028.793361] drbd resource19: receiver terminated
Aug 18 22:03:46 gnt1 kernel: [ 1028.793361] drbd resource19: Restarting receiver thread
Aug 18 22:03:46 gnt1 kernel: [ 1028.793362] drbd resource19: receiver (re)started
Aug 18 22:03:46 gnt1 kernel: [ 1028.793365] drbd resource19: conn( Unconnected -> WFConnection )
Aug 18 22:03:46 gnt1 kernel: [ 1029.102469] drbd resource19: conn( WFConnection -> Disconnecting )
Aug 18 22:03:46 gnt1 kernel: [ 1029.102492] drbd resource19: Discarding network configuration.
Aug 18 22:03:46 gnt1 kernel: [ 1029.102513] drbd resource19: Connection closed
Aug 18 22:03:46 gnt1 kernel: [ 1029.102518] drbd resource19: conn( Disconnecting -> StandAlone )
Aug 18 22:03:46 gnt1 kernel: [ 1029.102519] drbd resource19: receiver terminated
Aug 18 22:03:46 gnt1 kernel: [ 1029.102520] drbd resource19: Terminating drbd_r_resource
Aug 18 22:03:46 gnt1 kernel: [ 1029.102526] block drbd19: disk( Outdated -> Failed )
Aug 18 22:03:46 gnt1 kernel: [ 1029.149565] block drbd19: bitmap WRITE of 0 pages took 0 jiffies
Aug 18 22:03:46 gnt1 kernel: [ 1029.149652] block drbd19: 25 MB (6492 bits) marked out-of-sync by on disk bit-map.
Aug 18 22:03:46 gnt1 kernel: [ 1029.149655] block drbd19: disk( Failed -> Diskless )
Aug 18 22:03:46 gnt1 kernel: [ 1029.149789] block drbd19: drbd_bm_resize called with capacity == 0
Aug 18 22:03:46 gnt1 kernel: [ 1029.149798] drbd resource19: Terminating drbd_w_resource
Aug 18 22:03:48 gnt1 kernel: [ 1031.049038] drbd resource19: Starting worker thread (from drbdsetup-84 [11191])
Aug 18 22:03:48 gnt1 kernel: [ 1031.049163] block drbd19: disk( Diskless -> Attaching )
Aug 18 22:03:48 gnt1 kernel: [ 1031.049230] drbd resource19: Method to ensure write ordering: flush
Aug 18 22:03:48 gnt1 kernel: [ 1031.049232] block drbd19: max BIO size = 1048576
Aug 18 22:03:48 gnt1 kernel: [ 1031.049236] block drbd19: drbd_bm_resize called with capacity == 20987904
Aug 18 22:03:48 gnt1 kernel: [ 1031.049275] block drbd19: resync bitmap: bits=2623488 words=40992 pages=81
Aug 18 22:03:48 gnt1 kernel: [ 1031.049277] block drbd19: size = 10 GB (10493952 KB)
Aug 18 22:03:48 gnt1 kernel: [ 1031.050189] block drbd19: bitmap READ of 81 pages took 0 jiffies
Aug 18 22:03:48 gnt1 kernel: [ 1031.050224] block drbd19: recounting of set bits took additional 0 jiffies
Aug 18 22:03:48 gnt1 kernel: [ 1031.050226] block drbd19: 25 MB (6492 bits) marked out-of-sync by on disk bit-map.
Aug 18 22:03:48 gnt1 kernel: [ 1031.050228] block drbd19: disk( Attaching -> Outdated )
Aug 18 22:03:48 gnt1 kernel: [ 1031.050230] block drbd19: attached to UUIDs CAEADE6E0553E994:0000000000000000:830E7D632655AEA0:830D7D632655AEA0
Aug 18 22:03:48 gnt1 kernel: [ 1031.054932] drbd resource19: Method to ensure write ordering: flush
Aug 18 22:03:48 gnt1 kernel: [ 1031.078151] drbd resource19: conn( StandAlone -> Unconnected )
Aug 18 22:03:48 gnt1 kernel: [ 1031.078163] drbd resource19: Starting receiver thread (from drbd_w_resource [11194])
Aug 18 22:03:48 gnt1 kernel: [ 1031.078218] drbd resource19: receiver (re)started
Aug 18 22:03:48 gnt1 kernel: [ 1031.078225] drbd resource19: conn( Unconnected -> WFConnection )
Aug 18 22:03:48 gnt1 kernel: [ 1031.097638] drbd resource19: Method to ensure write ordering: flush
Aug 18 22:03:49 gnt1 kernel: [ 1031.576536] drbd resource19: Handshake successful: Agreed network protocol version 101
Aug 18 22:03:49 gnt1 kernel: [ 1031.576538] drbd resource19: Agreed to support TRIM on protocol level
Aug 18 22:03:49 gnt1 kernel: [ 1031.576642] drbd resource19: Peer authenticated using 16 bytes HMAC
Aug 18 22:03:49 gnt1 kernel: [ 1031.576657] drbd resource19: conn( WFConnection -> WFReportParams )
Aug 18 22:03:49 gnt1 kernel: [ 1031.576658] drbd resource19: Starting asender thread (from drbd_r_resource [11208])
Aug 18 22:03:49 gnt1 kernel: [ 1031.623267] block drbd19: drbd_sync_handshake:
Aug 18 22:03:49 gnt1 kernel: [ 1031.623275] block drbd19: self CAEADE6E0553E994:0000000000000000:830E7D632655AEA0:830D7D632655AEA0 bits:6492 flags:0
Aug 18 22:03:49 gnt1 kernel: [ 1031.623279] block drbd19: peer E9699ADC03C62D17:CAEADE6E0553E994:830E7D632655AEA0:830D7D632655AEA0 bits:6492 flags:0
Aug 18 22:03:49 gnt1 kernel: [ 1031.623282] block drbd19: uuid_compare()=-1 by rule 50
Aug 18 22:03:49 gnt1 kernel: [ 1031.623289] block drbd19: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Aug 18 22:03:49 gnt1 kernel: [ 1031.645115] block drbd19: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 339(1), total 339; compression: 99.9%
Aug 18 22:03:49 gnt1 kernel: [ 1031.645173] block drbd19: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 339(1), total 339; compression: 99.9%
Aug 18 22:03:49 gnt1 kernel: [ 1031.645175] block drbd19: conn( WFBitMapT -> WFSyncUUID )
Aug 18 22:03:49 gnt1 kernel: [ 1031.656092] block drbd19: updated sync uuid CAEBDE6E0553E994:0000000000000000:830E7D632655AEA0:830D7D632655AEA0
Aug 18 22:03:49 gnt1 kernel: [ 1031.658984] block drbd19: helper command: /bin/true before-resync-target minor-19
Aug 18 22:03:49 gnt1 kernel: [ 1031.659348] block drbd19: helper command: /bin/true before-resync-target minor-19 exit code 0 (0x0)
Aug 18 22:03:49 gnt1 kernel: [ 1031.659354] block drbd19: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )
Aug 18 22:03:49 gnt1 kernel: [ 1031.659360] block drbd19: Began resync as SyncTarget (will sync 25968 KB [6492 bits set]).
Aug 18 22:03:49 gnt1 kernel: [ 1032.028148] block drbd19: Resync done (total 1 sec; paused 0 sec; 25968 K/sec)
Aug 18 22:03:49 gnt1 kernel: [ 1032.028152] block drbd19: updated UUIDs E9699ADC03C62D16:0000000000000000:CAEBDE6E0553E994:CAEADE6E0553E994
Aug 18 22:03:49 gnt1 kernel: [ 1032.028156] block drbd19: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Aug 18 22:03:49 gnt1 kernel: [ 1032.033246] block drbd19: helper command: /bin/true after-resync-target minor-19
Aug 18 22:03:49 gnt1 kernel: [ 1032.033494] block drbd19: helper command: /bin/true after-resync-target minor-19 exit code 0 (0x0)

This is reproducible, it happens after each reboot. Adding a new
instance, "moves" this problem to the new one, while the "old one" is
fine (and "fixed").

Solution
--------
Stopping the instance and starting it again fixes this, but doesn't feel
right. Could anybody shed some light on why this is happening?

Thanks,
Georg

signature.asc

Alexandros Afentoulis

unread,

Aug 22, 2017, 5:29:25 AM8/22/17

to gan...@googlegroups.com, Georg Faerber

On 08/21/2017 07:38 PM, Georg Faerber wrote:
> Hi all,
>
> Environment
> -----------
> - Debian jessie
> - drbd-utils 8.9.5-1~bpo8+1
> - ganeti 2.15.2-1~bpo8+1
> - kernel 3.16.43-2+deb8u3
> - Small two node cluster, DRBD uses a dedicated, directly connected
> 1Gb interface (no PL, no buffer overruns, latency <1ms; it seems
> everything is fine regarding the network)
> - ~60 instances
>
> Problem
> -------
> I'm seeing the following: After a cluster reboot (shutdown all the
> instances, reboot the nodes, start all instances), the DRBD resource of
> the instance which was added the last to the cluster gets stuck (the
> following logs are produced by only one node, the ganeti master):
>
> /proc/drbd
> 19: cs:WFBitMapT ro:Secondary/Primary ds:Outdated/UpToDate C r---d- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:21096
>

[snip]

Hello there,

This kernel patch could be related,
https://patchwork.kernel.org/patch/5189451/ , but it's included in
version 3.19 and later.

Applying this patch over debian's linux 3.16.43 and after some reboots
showed that it might actually solve the problem.

Unfortunately living with patched/diverged kernel is not really viable.
If someone else tests and verifies this too, perhaps it would worth
report it to Debian.

Greetings

Georg Faerber

unread,

Aug 22, 2017, 6:21:04 AM8/22/17

to gan...@googlegroups.com

Just to get you right: Did you encounter this problem as well, and the
patch (thanks for the link!) fixed this for you?

Cheers,
Georg

P.S.: Not need to Cc: me, I'm subscribed to the list.

signature.asc

Alexandros Afentoulis

unread,

Aug 24, 2017, 6:06:25 AM8/24/17

to gan...@googlegroups.com

On 08/22/2017 01:20 PM, Georg Faerber wrote:
>
> Just to get you right: Did you encounter this problem as well, and the
> patch (thanks for the link!) fixed this for you?
>
> Cheers,
> Georg
>

Hello there,

yes that's what I'm saying.

Thing is we did not test this thoroughly, so please conduct your own
search and testing too. Also that's why I didn't go as far as reporting
it to Debian.

Greetings

Georg Faerber

unread,

Aug 24, 2017, 9:32:02 AM8/24/17

to gan...@googlegroups.com

Hi,

On 17-08-24 13:05:49, Alexandros Afentoulis wrote:
> yes that's what I'm saying.

Alright, thanks.

> Thing is we did not test this thoroughly, so please conduct your own
> search and testing too. Also that's why I didn't go as far as
> reporting it to Debian.

I would like to continue testing, but the patch doesn't apply cleanly.
Did you sort that out?

Cheers,
Georg

signature.asc

Alexandros Afentoulis

unread,

Aug 25, 2017, 3:26:33 AM8/25/17

to gan...@googlegroups.com

On 08/24/2017 04:31 PM, Georg Faerber wrote:
>
> I would like to continue testing, but the patch doesn't apply cleanly.
> Did you sort that out?
>
> Cheers,
> Georg
>

Hey Georg,

upstream patch does not apply as is, indeed.

This is the _debian patch_ applied against debian's linux 3.16.43:
https://pithos.okeanos.grnet.gr/public/yimx66W0SVPx6Hk52wMV14

Hope it'll work for you.

Greetings

Reply all

Reply to author

Forward