Accounting for drbd traffic

candlerb

unread,

Apr 20, 2018, 3:54:53 AM4/20/18

to ganeti

I have a 2-node ganeti cluster with spinning rust hard drives. Each node has a pair of drives in mdadm RAID1, and all the VMs are also drbd-replicated between the machines.

The disks are now running close to saturation - more than 60% utilisation - and things grind to a crawl when the weekly RAID scan takes place. I would like to find out if one of the VMs is causing a majority of disk I/O and if so reduce it.

This turns out to be easier said than done.

1. Accounting for disk writes from an individual process is not straightforward. (It might be easier if ganeti put each VM inside its own cgroup, which I believe libvirt does)

- I have netdata running, and have given it the UUID of each VM in its apps_groups.conf. But the volume (MB) of writes by a kvm process doesn't translate easily into disk ops; it depends very much on the size of each write, which VMs are doing fsyncs, which kvm cache mode is being used etc.

- "iotop" gives an instantaneous snapshot of I/O activity per process, but has the same problem as netdata: it's giving MB/sec not I/O operations per second.

- "iostat" gives an instantaneous snapshot of disk ops per /dev/drbd device, which I can tediously map back to VM, but is hard to interpret when there is a lot of bursty activity

2. Even if I could account for disk usage from individual processes, it's much harder to account for I/O ops generated by DRBD itself. If VM A on node 1 performs X amount of IOPS on /dev/drbdA, there will also be some amount of IOPS generated on node 2 on the corresponding /dev/drbdA (zero for reads but non-zero for writes). It's unclear if there's a 1:1 relationship, or whether writes are aggregated or amplified across DRBD.

Does anybody have any hints as to pinning down the source of the IOPS? (Apart from migrating to SSD where it wouldn't matter so much :-)

Thanks,

Brian.

P.S. There is one other interesting performance issue. Each node has a mixture of drives: one Seagate ST1000NM0011 and one Hitachi HUA721010KLA330. This seemed like a good idea at the time: make them less prone to simultaneous drive failure. Both drives are 1TB / 7200rpm, both drives have 512 byte sectors, both are claimed to be "enterprise" by the manufacturer.

However during the weekly mdadm RAID scrub, the utilisation of the Hitachi flatlines at 100% whilst the Seagate is around 65%.

Any possible clues here? I suppose the drives could have different physical geometries, although they return exactly the same number of blocks.

root@wrn-vm1:~# smartctl -a /dev/sda | egrep '(Model|Rotation)'

Model Family:     Seagate Constellation ES (SATA 6Gb/s)
Device Model:     ST1000NM0011
Rotation Rate:    7200 rpm
root@wrn-vm1:~# smartctl -a /dev/sdb | egrep '(Model|Rotation)'
Model Family:     Hitachi Ultrastar A7K1000
Device Model:     Hitachi HUA721010KLA330

root@wrn-vm1:~# cat /sys/block/sda/queue/physical_block_size
512
root@wrn-vm1:~# cat /sys/block/sdb/queue/physical_block_size
512

root@wrn-vm1:~# blockdev --getsize64 /dev/sda
1000204886016
root@wrn-vm1:~# blockdev --getsize64 /dev/sdb
1000204886016

root@wrn-vm1:~# hdparm /dev/sda

/dev/sda:
multcount     = 0 (off)
IO_support    = 1 (32-bit)
readonly      = 0 (off)
readahead     = 256 (on)
geometry      = 121601/255/63, sectors = 1953525168, start = 0
root@wrn-vm1:~# hdparm /dev/sdb

/dev/sdb:
multcount     = 0 (off)
IO_support    = 1 (32-bit)
readonly      = 0 (off)
readahead     = 256 (on)
geometry      = 121601/255/63, sectors = 1953525168, start = 0

root@wrn-vm1:~# hdparm -a /dev/sda

/dev/sda:
readahead = 256 (on)
root@wrn-vm1:~# hdparm -a /dev/sdb

/dev/sdb:
readahead = 256 (on)

root@wrn-vm1:~# hdparm -A /dev/sda

/dev/sda:
look-ahead = 1 (on)
root@wrn-vm1:~# hdparm -A /dev/sdb

/dev/sdb:
look-ahead = 1 (on)

candlerb

unread,

Apr 20, 2018, 8:40:10 AM4/20/18

to ganeti

Digging down further: I have found through netdata that there are two DRBD devices (drbd16 on node1, drbd17 on node2) which are each generating about 100 write ops per second, although only writing about 0.5MB/sec. This implies lots of small transaction flushes.

The association between instance and drbd can be found as symlinks in this directory:

root@wrn-vm1:~# ls -l /var/run/ganeti/instance-disks/

total 0

lrwxrwxrwx 1 root root 11 Feb 1 22:22 alvm1:0 -> /dev/drbd14

lrwxrwxrwx 1 root root 11 Feb 25 11:42 stackstorm.int.example.com:0 -> /dev/drbd21

lrwxrwxrwx 1 root root 11 Feb 13 15:10 temp-rec:0 -> /dev/drbd20

lrwxrwxrwx 1 root root 11 Feb 1 22:20 wrn-dns1.int.example.com:0 -> /dev/drbd16

At the drbd level, the stats are only seen on the master side. To see the generated IOPS on the slave side I need to look at the LVM partition devices. I found this by looking at "gnt-instance info", looking for "child devices" of the drbd disk, e.g. /dev/xenvg/d015d3d9-f62e-4a5e-b9ce-e1f635bb36f5.disk0_data, then again there's a symlink:

root@wrn-vm2:~# ls -l /dev/xenvg/d015d3d9-f62e-4a5e-b9ce-e1f635bb36f5.disk0_data

lrwxrwxrwx 1 root root 8 Jan 25 19:31 /dev/xenvg/d015d3d9-f62e-4a5e-b9ce-e1f635bb36f5.disk0_data -> ../dm-29

and hey presto, I can see the IOPS on that device on the slave too. So 100 write IOPS locally on node 2 plus 100 IOPS for replication from node1 is a lot of IOPS for a spinning disk.

Now here's the painful part... those VMs are actually lxd container hosts with btrfs, running services including three other containers. So now I have to dig inside that VM to find out which process is generating all those flushes :-(

This means we've moved outside the sphere of ganeti, in which case I'm happy to park the question - unless it turns out that something is screwed up with write caching and for some reason every 4KB block write is turning into a flush. But I don't think that should be the case - gnt-instance info says:

disk_aio: default (threads)

disk_cache: default (default)

disk_type: default (paravirtual)

And according to the manpage of qemu-system-x86_64, the default is cache=writeback

Regards,

Brian.

Sascha Lucas

unread,

Apr 20, 2018, 9:44:58 AM4/20/18

to gan...@googlegroups.com

Hello Brian,

just a view comments:

Have you considerd the qemu block stats:

echo "info blockstats" | socat STDIO UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/your.vm.monitor

iostat would be my choice. I agree mapping DRBDs/LVs to VMs is tediously,
but at the end one can get a graph with the I/Os you want.

On Fri, 20 Apr 2018 09:54:53 +0200 candlerb wrote:

> non-zero for writes). It's unclear if there's a 1:1 relationship, or whether
> writes are aggregated or amplified across DRBD.

AFAIK the I/Os on DRBD are 1:1 on primary and secondary node (respectivly on
the data LVs). However DRBD produces some extra I/Os on the meta LVs when
doing writes (I think on primary node only).

> However during the weekly mdadm RAID scrub, the utilisation of the Hitachi
> flatlines at 100% whilst the Seagate is around 65%.

I have seen that, long time ago, but no explanation. However there might be
more things to consider: recoverable read errors produce performance
degradation. I even wouldn't expect different disks/vendors producing same
performance.

Thanks, Sascha.

candlerb

unread,

Apr 20, 2018, 10:31:43 AM4/20/18

to ganeti

> echo "info blockstats" | socat STDIO UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/your.vm.monitor

That's neat, thank you. That could probably be turned into a prometheus exporter or somesuch.

Benjamin Redling

unread,

Apr 20, 2018, 11:32:44 AM4/20/18

to gan...@googlegroups.com

Am 20.04.2018 um 15:44 schrieb Sascha Lucas:
>> However during the weekly mdadm RAID scrub, the utilisation of the
>> Hitachi flatlines at 100% whilst the Seagate is around 65%.

> I have seen that, long time ago, but no explanation. However there might
> be more things to consider: recoverable read errors produce performance
> degradation. I even wouldn't expect different disks/vendors producing
> same performance.

I was too curious and looked it up:
that Seagate Constellation model has 64MB Cache, that HGST Ultrastar
model there in use only 32MB and has 4KB sectors and emulates 512KB

Regards,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323

candlerb

unread,

Apr 20, 2018, 12:22:40 PM4/20/18

to ganeti

OK, that bit about emulating 512 byte sectors might be relevant (where did you find info about that model having 4KB physical sectors?)

I am looking at two graphs now, running netdata both inside the VM and on the ganeti host.

* Inside the VM, I am looking at the virtual disk (vda)

* On the host, I am looking at the LVM volume which maps to this virtual disk

As you would expect, they both track each other very closely. But the host sees four times as many disk IOPS. Looking at the point in time I've highlighted (the red dot):

* Guest: 1.00 MB/sec, 79 writes/sec (implies average 13KB per write)

* Host: 1.04 MB/sec, 294 writes/sec (implies average 3.6KB per write)

That doesn't seem correct. Identical figures are seen if I look at the corresponding drbd device (1.04MB/sec, 294 writes/sec)

Now, this LVM volume sits on a physical volume which is a RAID1 of two disks. There are other VMs running, and also this host is the target of DRBD replication from the other host, so that's mixed in here. But looking at the physical disks anyway:

These I can believe. At the same point in time, they show 1.6MB/sec (179 ops/sec) and 1.64MB/sec (201 ops/sec), which could well be 79 ops/sec from this host plus ~100 ops/sec from other VMs, including replication from the other host. And as you can see, this makes the disks rather busy:

(especially the lower purple graph, % utilisation)

However I'm at a loss to explain the 4 times discrepancy between host LVM/drbd stats and guest stats. Maybe I need to dig down deeper into LVM block accounting.

Regards, Brian.

sascha...@web.de

unread,

Apr 21, 2018, 12:32:09 PM4/21/18

to gan...@googlegroups.com

Hi Brian,

on Fri, 20 Apr 2018 18:22:40 +0200 candlerb wrote:

> As you would expect, they both track each other very closely

I wouldn't expect this with cache=default (writeback). I think today qemu
correctly handles disk flushes from the guest to the host. So I assume, your
guest flushes often?

I think writeback cache is unsafe for live migration. I use cache=none
(direct I/O), bypassing host page cache for varios good reasons :-).

> However I'm at a loss to explain the 4 times discrepancy between host
> LVM/drbd stats and guest stats. Maybe I need to dig down deeper into LVM
> block accounting.

When looking at I/Os on LVs/PVs it is common to see more I/Os on the LV then
on its PV, because of I/O merging (wrqm in iostat) at the physical level.
especially when doing bufferd wirtes (writeback cache).

Thanks, Sascha.

candlerb

unread,

Apr 21, 2018, 5:09:45 PM4/21/18

to ganeti

On Saturday, 21 April 2018 17:32:09 UTC+1, sascha...@web.de wrote:

> As you would expect, they both track each other very closely

I wouldn't expect this with cache=default (writeback). I think today qemu
correctly handles disk flushes from the guest to the host. So I assume, your
guest flushes often?

I'm not sure. I installed auditd in the guest and told it to look for sync events:

auditctl -S sync -S fsync -S fdatasync -a exit,always

and most of the time it didn't show anything. Occasionally there was a short burst of them (from a samba AD domain controller) - a peak of 7 per second over 2 or 3 seconds.

When looking at I/Os on LVs/PVs it is common to see more I/Os on the LV then
on its PV, because of I/O merging (wrqm in iostat) at the physical level.

That makes sense. But I don't understand why the number of actual LV writes is higher than the writes in the guest virtual disk which maps to that LV.

candlerb

unread,

Apr 22, 2018, 6:13:36 AM4/22/18

to ganeti

Ah, I see now: writes aren't aggregated at the LVM layer, making the number of LV write operations/sec higher than the underlying disks.

I can demonstrate this without KVM at all: I first created a LV by creating a stopped ganeti instance, then ran this test program directly on the host:

import os

import sys

import time

interval = 0.1

blks = [os.urandom(1048576) for i in range(2)]

fd = os.open("/var/run/ganeti/instance-disks/tstplain:0", os.O_WRONLY)

t1 = time.time()

while True:

for blk in blks:

print(".", file=sys.stderr)

os.lseek(fd, 0, os.SEEK_SET)

os.write(fd, blk)

t1 += interval

time.sleep(t1 - time.time())

10 times per second, this program writes 1MB of data to the given LVM volume, but without a sync so it just makes dirty pages.

What I see on the host is that as expected, every 20-30 seconds there is a flush of the page cache to disk, which writes 1MB of data, but this is counted as 256 writes. (In Netdata this is split over two adjacent seconds: e.g. writes of 0.27MB/sec and 0.73MB/sec; and 69.6 write ops/sec and 186.4 write ops/sec in the same two seconds).

To be able to see the behaviour on the physical underlying disks I had to shut down all ganeti instances first, and it's clear they are aggregated: I see 1MB of writes aggregated into 2 write operations.

So that is at least one issue removed.

Incidentally, there's some other weird stuff with LVM. Every time you open an LVM volume for write, it generates a large amount of reads. For example, run this program:

import os

import sys

import time

t1 = time.time()

interval = 0.1

while True:

print(".", file=sys.stderr)

fd = os.open("/var/run/ganeti/instance-disks/tstplain:0", os.O_WRONLY)

#os.lseek(fd, 0, os.SEEK_SET)

#os.write(fd, os.urandom(256))

os.close(fd)

t1 += interval

time.sleep(t1 - time.time())

It just opens and closes an LVM volume (for write) 10 times per second, with no actual writes. But it generates 20MB/sec of reads and 860 read operations/sec on that volume.

candlerb

unread,

Apr 22, 2018, 6:20:49 AM4/22/18

to ganeti

Oh, and I think I worked out where the flushes are coming from, although I have not proved this independently yet.

The VM which contains the containers, and is generating a large amount of disk I/O, is using btrfs. And looking at the mount options for btrfs:

barrier, nobarrier

: (default: on); Ensure that all IO write operations make it through the device cache and are stored permanently when the filesystem is at it's consistency checkpoint. This typically means that a flush command is sent to the device that will synchronize all pending data and ordinary metadata blocks, then writes the superblock and issues another flush.; The write flushes incur a slight hit and also prevent the IO block scheduler to reorder requests in a more effective way. Disabling barriers gets rid of that penalty but will most certainly lead to a corrupted filesystem in case of a crash or power loss. The ordinary metadata blocks could be yet unwritten at the time the new superblock is stored permanently, expecting that the block pointers to metadata were stored permanently before.

I guess that makes sense, and it would be a very bad idea to set cache=writeback on the VM. cache=none ought to be fine, but in practice I don't think will make much difference given than btrfs in the guest is doing explicit flushes. In fact, it's quite comforting to know that the barrier stuff appears to be working.

Benjamin Rampe

unread,

Apr 23, 2018, 10:29:04 AM4/23/18

to gan...@googlegroups.com

On 20/04/18 18:22, candlerb wrote:
> OK, that bit about emulating 512 byte sectors might be relevant (where did
> you find info about that model having 4KB physical sectors?)

I *lazily* trust the portal https://geizhals.de -- also used by heise /
aka. the "H" and its search field. UK[en]: https://skinflint.col.uk

Who needs dysfunctional vendor support sites ;)

candlerb

unread,

Apr 25, 2018, 4:35:05 AM4/25/18

to ganeti

sysdig is my new best friend: a sort of combination of strace+top. But so far, I've just been able to show that there is very little in the way of write() or writev() activity from any of the running applications.

I've also tried stopping netdata itself just to be sure - but netdata on the host still shows the significant amount of background disk activity. Mounting the btrfs filesystem with 'noatime,nodiratime' hasn't made a difference either (the default is "relatime" which ought to be fine anyway).

There is no swap inside this VM, but I have a suspicion that Samba is doing a bunch of writes via mmap(). Not sure how to pin that down though.

candlerb

unread,

Apr 25, 2018, 5:50:25 AM4/25/18

to ganeti

FWIW, I think I got to the bottom of this. Documenting here in case anyone is interested.

(1) There is very little write() or writev() activity going on. Over 20 minutes:

# sysdig -c topfiles_bytes -r sysdig.dump3 evt.type=write or evt.type=writev

Bytes Filename

--------------------------------------------------------------------------------

85.00KB /dev/urandom

8.36KB /usr/local/samba/private/named.conf.update.tmp

5.12KB /var/log/syslog

5.02KB /var/log/freeradius/radacct/192.168.7.71/auth-detail-20180425

2.11KB /var/log/freeradius/radius.log

1.93KB /var/log/freeradius/radacct/192.168.7.71/reply-detail-20180425

746B /var/log/pdns.info

594B /var/log/auth.log

252B /var/log/cron

186B /dev/null

(Aside: I can see that Samba 4.7.7 periodically opens /dev/urandom for write, and then writes 8 x 128 byte blocks to it. Very odd, but not really significant)

(2) However, btrace shows a significant amount of disk writes from "samba" and "python" processes.

# btrace -w 60 -s /dev/vda

...

(Aside: "apt-get install blktrace --no-install-recommends, otherwise you get the whole kitchen sink including X11 and Mono)

(3) Looking for execve in sysdig, the python processes are:

/usr/local/samba/sbin/samba_dnsupdate
/usr/local/samba/sbin/samba_spnupdate
/usr/local/samba/sbin/samba_kcc (replication topology)

(4) I didn't find an easy way to map block offsets into the btrfs files that contain those blocks.

However, sysdig tells me that a whole load of files under /usr/local/samba are being opened by mmap() for write, and indeed I can them being touched frequently:

[inside container]

# find /usr/local/samba -type f -mmin -10 | xargs ls -ld
-rw------- 1 root root 10383360 Apr 25 09:06 /usr/local/samba/private/sam.ldb.d/CN=CONFIGURATION,DC=AD,DC=EXAMPLE,DC=COM.ldb
-rw------- 1 root root 10383360 Apr 25 09:06 /usr/local/samba/private/sam.ldb.d/CN=SCHEMA,CN=CONFIGURATION,DC=AD,DC=EXAMPLE,DC=COM.ldb
-rw------- 1 root root  6643712 Apr 25 09:07 /usr/local/samba/private/sam.ldb.d/DC=AD,DC=EXAMPLE,DC=COM.ldb
-rw------- 1 root root 15134720 Apr 25 09:07 /usr/local/samba/private/sam.ldb.d/DC=DOMAINDNSZONES,DC=AD,DC=EXAMPLE,DC=COM.ldb
-rw------- 1 root root  4247552 Apr 25 09:06 /usr/local/samba/private/sam.ldb.d/DC=FORESTDNSZONES,DC=AD,DC=EXAMPLE,DC=COM.ldb
-rw-r----- 1 root root   831488 Apr 25 09:07 /usr/local/samba/private/sam.ldb.d/metadata.tdb
-rw------- 1 root root    16384 Apr 25 09:07 /usr/local/samba/private/schannel_store.tdb
-rw-r--r-- 1 root root  1327104 Apr 25 09:07 /usr/local/samba/var/cache/gencache.tdb
-rw-r--r-- 1 root root   454656 Apr 25 09:06 /usr/local/samba/var/lock/brlock.tdb
-rw-r--r-- 1 root root   454656 Apr 25 09:07 /usr/local/samba/var/lock/gencache_notrans.tdb
-rw-r--r-- 1 root root    73728 Apr 25 09:06 /usr/local/samba/var/lock/leases.tdb
-rw-r--r-- 1 root root   507904 Apr 25 09:06 /usr/local/samba/var/lock/locking.tdb
-rw-r--r-- 1 root root       21 Apr 25 09:07 /usr/local/samba/var/lock/msg.lock/27210
-rw-r----- 1 root root     8192 Apr 25 09:07 /usr/local/samba/var/lock/names.tdb
-rw-r--r-- 1 root root    16384 Apr 25 09:07 /usr/local/samba/var/lock/serverid.tdb
-rw------- 1 root root     8888 Apr 25 09:07 /usr/local/samba/var/lock/smbXsrv_client_global.tdb
-rw------- 1 root root    36864 Apr 25 09:06 /usr/local/samba/var/lock/smbXsrv_open_global.tdb
-rw------- 1 root root    40960 Apr 25 09:07 /usr/local/samba/var/lock/smbXsrv_session_global.tdb
-rw------- 1 root root    24576 Apr 25 09:07 /usr/local/samba/var/lock/smbXsrv_tcon_global.tdb
-rw-r--r-- 1 root root    16384 Apr 25 09:07 /usr/local/samba/var/lock/smbd_cleanupd.tdb
-rw------- 1 root root    32768 Apr 25 08:59 /usr/local/samba/var/locks/winbindd_cache.tdb

So I think I just have to accept that samba4 (when running as a domain controller) is a chatty application which touches files on disk frequently.

I can also see there are quite a lot of msync() and fdatasync() calls:

# sysdig -r sysdig.dump3 -c topscalls evt.type contains sync
# Calls             Syscall
--------------------------------------------------------------------------------
5592                msync
5592                fdatasync

(this counts entry and exit separately, i.e. there are 2796 msync and fdatasync calls in about 20 minutes, but that's still about 2.3 per second on average)

It could perhaps be improved by using ZFS as the underlying filesystem, either inside the guest or on the VM host, as it would batch up writes. There is the risk of some fragmentation, which probably won't matter if the hot blocks stay in the block cache.

So I'm thinking of moving to using ganeti + local image files on ZFS datasets. Unfortunately:

(1) ganeti only does raw files, not qcow2

(2) I want to store each image file inside its own ZFS dataset (so they can be separately replicated to a backup server), but ganeti won't automatically create the datasets for me. I think the extstorage interface is for block-level devices only, and I don't want to use zvols - see here.

Cheers,

Brian.

Phil Regnauld

unread,

Apr 25, 2018, 7:13:13 AM4/25/18

to gan...@googlegroups.com

candlerb (b.candler) writes:
>
> It could perhaps be improved by using ZFS as the underlying filesystem,
> either inside the guest or on the VM host, as it would batch up writes.
> There is the risk of some fragmentation, which probably won't matter if the
> hot blocks stay in the block cache.
>
> So I'm thinking of moving to using ganeti + local image files on ZFS
> datasets. Unfortunately:
>
> (1) ganeti only does raw files, not qcow2
>
> (2) I want to store each image file inside its own ZFS dataset (so they can
> be separately replicated to a backup server), but ganeti won't
> automatically create the datasets for me. I think the extstorage interface
> is for block-level devices only, and I don't want to use zvols - see here

> <http://www.openoid.net/psa-snapshots-are-better-than-zvols/>.

Have you looked at https://github.com/brigriffin/ganeti-extstorage-zfs ?

I've set it up on a couple of test nodes, and it is pretty convincing.

I've gone through the same thought process as you - but didn't feel
like writing a new storage provider that did the dataset creation, etc.

For one setup, I ended up using ZFS+FreeBSD as an NFS server, and
using NFSv4 dynamic mounting of subdirs.

candlerb

unread,

Apr 26, 2018, 7:43:53 AM4/26/18

to ganeti

> Have you looked at https://github.com/brigriffin/ganeti-extstorage-zfs ?

Only briefly - enough to see it was using zvols, and also that it does drbd replication on top of zvols. (But I guess I could ignore that and just use -t plain)

> For one setup, I ended up using ZFS+FreeBSD as an NFS server, and

> using NFSv4 dynamic mounting of subdirs.

Sure, with -t sharedfile there are other options.

Reply all

Reply to author

Forward