> Read throughput on my instances (Xen PVM+ganeti instance debootstrap) is great (~1020 MB/sec)
How do you measure this? Don't know xen, but with KVM you can measure nonsense, when doing reads and have host-side caching active (kvm:disk_cache=writethrough).
> but of course the write throughput is far behind due to DRBD I suppose (~60 MB/s). That's somehow bad. I get sequential write of 112MB/s with SAS-Disks. This is near 1Gbit.
> 1) increase DRBD's sync rate to ~100 MB/s to use the full capacity of my Gigabit network (100 * 1024):
The sync rate affects only initial or resync of DRBD. It does not say how fast you can write (think of HW RAID initialisation).
> Any ideas what I can do more to get a better write performance?
Use the noop or deadline I/O scheduler on the host _and_ the guest. DRBD's activity log produces "bad patterns" for CFQ.
> hdparm: 1031.28 MB/sec
> bonnie++: 1113753 K/sec Block Sequential Input
This sounds still high (no caching effects? To be sure run benchmarks on the host). I get on 4xSSD RAID10
# hdparm -tT /dev/cciss/c0d0p2
/dev/cciss/c0d0p2:
Timing cached reads: 19316 MB in 2.00 seconds = 9666.67 MB/sec
Timing buffered disk reads: 1410 MB in 3.00 seconds = 469.65 MB/sec
# dd if=/dev/cciss/c0d0p2 of=/dev/null bs=1M count=10024
10024+0 records in
10024+0 records out
10510925824 bytes (11 GB) copied, 21.3963 s, 491 MB/s
> but now I suspect my method of measuring write throughput of being not reliable (dd). Here is the dd command I used to measure write throughput:
> dd if=/dev/zero of=file bs=64k count=5000 oflag=direct
Perhaps bs is too small? This is more a bs for random I/O. I do the same but with bs=1M.
> Is there maybe any documentation on how to change these settings?
https://www.kernel.org/doc/Documentation/block/switching-sched.txt
> Now I suppose that if I am already at the limit of my write throughput with ~110 MB/s using a Gigabit network, changing the IO scheduler on my node and instances won't change anything?
Yes and no. Yes because you have SSDs and DRBD's extra I/Os on the meta-data disk (wrongly ordered by the scheduler) won't hurt (so much). No because you measured sequential I/O. Most applications produce random I/O. To fill 1Gbit bandwidth with random I/O in a read/write mixed environment one would need tons of SSDs. My 4 disk RAID-10 has 12912KB/s write bandwidth with the "fio iometer-file-access-server" benchmark. Yes 100Mbit Ethernet would do nicely. What here counts is IOPS not bandwidth. I get 16725 IOPS read and 4188 IOPS write. With CFQ write IOPS is more than 2 times smaller (over DRBD).
Yes and no. Yes because you have SSDs and DRBD's extra I/Os on the meta-data disk (wrongly ordered by the scheduler) won't hurt (so much). No because you measured sequential I/O. Most applications produce random I/O. To fill 1Gbit bandwidth with random I/O in a read/write mixed environment one would need tons of SSDs. My 4 disk RAID-10 has 12912KB/s write bandwidth with the "fio iometer-file-access-server" benchmark. Yes 100Mbit Ethernet would do nicely. What here counts is IOPS not bandwidth. I get 16725 IOPS read and 4188 IOPS write. With CFQ write IOPS is more than 2 times smaller (over DRBD).
Hi John,
sorry for delayed answer...
Yes this makes me also surprised. May I assume, you measured also on the node (dom0) without DRBD and have the same IPOS?
In the past I've seen such differences on different filesystems. I.e. CFQ has no effect on debian squeeze and ext3 but on reiserfs. It is said, that every $FS has its own optimization for rotating media and therefore could influence the block layer scheduler. Maybe different kernel versions / $FS have different behavior (are more or less SSD aware)?
So again like on an instance there are hardly any differences between the 3 schedulers. What I can notice though through this test is that I get around 25-30% less IOPS on an instance with DRBD disk template than on the node. I suppose here that this performance drop is not only DRBD's "fault" but also the fact that the instance is a VM. Would still have to test on an instance using the plain disk template instead of DRBD.
Hi John,
Yes. Virtualisation does not influence such kind of I/O.
However, you remember my 4xSSD RAID-10. I managed to have no performance drop over DRBD compared to a "plain" (LVM) instance, with the following settings:
* 2 Nodes equal in I/O performance
* Ethernet latency small enough (<= 1/IOPS-write); in practice you can't do more than ~10k IOPS over a single connection
* Ganeti DRBD disk parameters: disk-barriers=bf, meta-barriers=True, net-custom="--max-buffers 8000 --max-epoch-size 8000"
* deadline I/O scheduler
WRT DRBD tuning[1] disk-barriers=bf is safe on RAID+BBU. I'm unsure about disabling meta data barriers. DRBD.CONF(5) doesn't make it clear enough for me. Playing around with unplug-watermark have no affects for me.
Even if my RAID isn't as performant as yours, my general discovery was, that there must be no performance loss with DRBD.
> Check, I bundle two Gigabit NICs in an EtherChanel (802.3ad) on a Cisco switch.
That's another story... Hopefully you have configured xmit_hash_policy="layer 3+4" and the same on the switch side. A single TCP connection can't get more then 1Gbit, but two can. LACP places the second connection (statistically) on the second NIC. Think of 2 DRBD disks for an instance, every disk having its own TCP connection. Inside the instance you stripe (LVM or software RAID) over these two disks. Now you can get 2x1Gbit sequential I/O and 2x ~10k IOPS write random I/O.
As mentioned on the DRBD tuning site and the drbd.conf man page disabling barrier and flush is absolutely safe with BBU. All this just say how write ordering is done. Barrier does not work over LVM with old kernels (my one). And disabling flush does not mean to disable write ordering. Method drain is used.
What I'm not sure about is disabling meta flushes (should be asked on DRBD list).
> Somehow I can't find the right parameter format to do that with gnt-cluster:
$ gnt-cluster modify --disk-parameters drbd:disk-barriers=bf,meta-barriers=true,net-custom='--max-buffers 8000 --max-epoch-size 8000'
> Are you using layer3+4?
Yes. I'm using it for years now. I'm aware about the compliance warning. But all switches I've seen so far have such a balancing policy. And it works. My interpretation is, that the balancing algorithm is not part of LACP (is not negotiated). So every sender balances with its own algorithm (hash method) and the receiver has to accept this. What counts is, that every packet of a "session" (normal traffic with IP/UDP+TCP) flows on the same path. Means no out of order packets coming faster on another path within the same session/connection.
ATM I have Cisco behind (Nexus 5000) and it has an additional algorithm to chose: source-dest-port (Layer 4).
> the results are horrible: 2041 IOPS read and 509 IOPS write.
Wow. I'm surprised. One evidence more, that most things needs interpretation and don't apply generally. I assume you get the same results with CFQ/deadline/noop?
> the results are horrible: 2041 IOPS read and 509 IOPS write.
Wow. I'm surprised. One evidence more, that most things needs interpretation and don't apply generally. I assume you get the same results with CFQ/deadline/noop?