DRBD is loaded with minor_count=128
Gave 4GB to dom0 because initially only had 512MB, but ran into issues
with out of memory errors when trying to add drbd instances, this was
in production and did not have the time to test, so gave it too much
versus too little.
Source server dmesg log:
block drbd1: Starting worker thread (from cqueue/1 [258])
block drbd1: disk( Diskless -> Attaching )
block drbd1: No usable activity log found.
block drbd1: Method to ensure write ordering: barrier
block drbd1: max_segment_size ( = BIO size ) = 32768
block drbd1: drbd_bm_resize called with capacity == 419430400
block drbd1: resync bitmap: bits=52428800 words=819200
block drbd1: size = 200 GB (209715200 KB)
block drbd1: Writing the whole bitmap, size changed
block drbd1: 200 GB (52428800 bits) marked out-of-sync by on disk bit-
map.
block drbd1: recounting of set bits took additional 2 jiffies
block drbd1: 200 GB (52428800 bits) marked out-of-sync by on disk bit-
map.
block drbd1: disk( Attaching -> Inconsistent )
block drbd1: Barriers not supported on meta data device - disabling
block drbd1: conn( StandAlone -> Unconnected )
block drbd1: Starting receiver thread (from drbd1_worker [19944])
block drbd1: receiver (re)started
block drbd1: conn( Unconnected -> WFConnection )
block drbd1: Handshake successful: Agreed network protocol version 94
block drbd1: Peer authenticated using 16 bytes of 'md5' HMAC
block drbd1: conn( WFConnection -> WFReportParams )
block drbd1: Starting asender thread (from drbd1_receiver [19949])
block drbd1: data-integrity-alg: <not-used>
block drbd1: drbd_sync_handshake:
block drbd1: self
0000000000000004:0000000000000000:0000000000000000:0000000000000000
bits:52428800 flags:0
block drbd1: peer
0000000000000004:0000000000000000:0000000000000000:0000000000000000
bits:52428800 flags:0
block drbd1: uuid_compare()=0 by rule 10
block drbd1: No resync, but 52428800 bits in bitmap!
block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams ->
Connected ) pdsk( DUnknown -> Inconsistent )
block drbd1: role( Secondary -> Primary ) disk( Inconsistent ->
UpToDate )
block drbd1: Forced to consider local data as UpToDate!
block drbd1: Creating new current UUID
block drbd1: drbd_sync_handshake:
block drbd1: self 8B26969D7F9BD80D:
0000000000000004:0000000000000000:0000000000000000 bits:52428800 flags:
0
block drbd1: peer
0000000000000004:0000000000000000:0000000000000000:0000000000000000
bits:52428800 flags:0
block drbd1: uuid_compare()=2 by rule 30
block drbd1: Becoming sync source due to disk states.
block drbd1: Writing the whole bitmap, full sync required after
drbd_sync_handshake.
block drbd1: 200 GB (52428800 bits) marked out-of-sync by on disk bit-
map.
block drbd1: conn( Connected -> WFBitMapS )
block drbd1: conn( WFBitMapS -> SyncSource )
block drbd1: Began resync as SyncSource (will sync 209715200 KB
[52428800 bits set]).
block drbd1: peer( Secondary -> Unknown ) conn( SyncSource ->
TearDown )
block drbd1: asender terminated
block drbd1: Terminating asender thread
block drbd1: Connection closed
block drbd1: conn( TearDown -> Unconnected )
block drbd1: receiver terminated
block drbd1: Restarting receiver thread
block drbd1: receiver (re)started
block drbd1: conn( Unconnected -> WFConnection )
block drbd1: role( Primary -> Secondary )
block drbd1: conn( WFConnection -> Disconnecting )
block drbd1: Discarding network configuration.
block drbd1: Connection closed
block drbd1: conn( Disconnecting -> StandAlone )
block drbd1: receiver terminated
block drbd1: Terminating receiver thread
block drbd1: disk( UpToDate -> Diskless ) pdsk( Inconsistent ->
DUnknown )
block drbd1: drbd_bm_resize called with capacity == 0
block drbd1: worker terminated
block drbd1: Terminating worker thread
In the logs, everything is fine but the sync just stalls, and given
time never recovers (5-10 minutes). No dropped packets or any other
indications in the ring buffer as to any errors.
Here is another bit of information: I can add more than 17 drbd
instances to nodes when the instances are not doing anything. I added
20 pvm instances that just hung in kickstart (meaning the instance was
started but was just waiting for input) without any issue on 2 nodes
(50/50 primary on each node) that were recently added/updated.
I am stumped.
On Apr 13, 11:59 am, Simon Deziel <
simon.dez...@gmail.com> wrote:
> On 12-04-13 11:58 AM, ampictage wrote:
>
> > Wondering if anyone else has had this problem:
>
> > I have a cluster of 5 ganeti hosts (xen with dom0_mem=4096M), but I
> > cannot add more that 16 drbd disks per host, once the 17th instance is
> > created, it stalls on sync with short read errors and never finishes.
>
> By default, the drbd modules only allows to create 32 devices which you
> might have reached with 16 instances (each instance takes 2 DRBD
> devices: data & metadata). Did you increased the minor_count when
> loading the drbd module ? Seehttp://
docs.ganeti.org/ganeti/master/html/install.html#installing-drbd