DRBD tuning for Ganeti

1,817 views
Skip to first unread message

John N.

unread,
Jun 30, 2013, 3:29:22 PM6/30/13
to gan...@googlegroups.com
Hi,

I am using Ganeti 2.6.2 on Debian with 4 SSD disks per node in a hardware RAID5 array with LVM and DRBD. Read throughput on my instances (Xen PVM+ganeti instance debootstrap) is great (~1020 MB/sec) but of course the write throughput is far behind due to DRBD I suppose (~60 MB/s). Now I would like to try to get more than 60 MB/s write throughput and am looking for any other recommendations on how I can acheive a better throughput.

What I already did to acheive the 60 MB/s is the following:

1) increase DRBD's sync rate to ~100 MB/s to use the full capacity of my Gigabit network (100 * 1024):

gnt-cluster modify -D drbd:c-max-rate=102400,resync-rate=102400

2) set the Gigabit network interface to a MTU to 9000

3) turn off various NIC offloading features:

ethtool --offload eth1  gso off tso off sg off gro off

In my understanding with these parameters I should at best achieve 100 MB/s write throughput on my instances, unfortunately it doesn't go over the 60 MB/s as mentioned. On the other hand the DRBD sync (as seen in /proc/drbd) goes faster and really uses the 100 MB/s available.

Any ideas what I can do more to get a better write performance? or is this the maximum one can get with my current setup of a Gigabit network, Xen PVM and DRBD?

Cheers,
John


Lucas, Sascha

unread,
Jul 1, 2013, 3:26:58 AM7/1/13
to gan...@googlegroups.com
Hi John,

From: John N.
Date: Sun, 30. June 2013 21:29

> Read throughput on my instances (Xen PVM+ganeti instance debootstrap) is great (~1020 MB/sec)

How do you measure this? Don't know xen, but with KVM you can measure nonsense, when doing reads and have host-side caching active (kvm:disk_cache=writethrough).

> but of course the write throughput is far behind due to DRBD I suppose (~60 MB/s).

That's somehow bad. I get sequential write of 112MB/s with SAS-Disks. This is near 1Gbit.

> 1) increase DRBD's sync rate to ~100 MB/s to use the full capacity of my Gigabit network (100 * 1024):

The sync rate affects only initial or resync of DRBD. It does not say how fast you can write (think of HW RAID initialisation).

> On the other hand the DRBD sync (as seen in /proc/drbd) goes faster and really uses the 100 MB/s available.

So your network is good.

> Any ideas what I can do more to get a better write performance?

Use the noop or deadline I/O scheduler on the host _and_ the guest. DRBD's activity log produces "bad patterns" for CFQ.

Thanks, Sascha.

Vorsitzender des Aufsichtsrates: Ralf Hiltenkamp
Geschäftsführung: Michael Krüger (Sprecher), Stephan Drescher
Sitz der Gesellschaft: Halle/Saale
Registergericht: Amtsgericht Stendal | Handelsregister-Nr. HRB 208414
UST-ID-Nr. DE 158253683
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Empfänger sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail oder des Inhalts dieser Mail sind nicht gestattet. Diese Kommunikation per E-Mail ist nicht gegen den Zugriff durch Dritte geschützt. Die GISA GmbH haftet ausdrücklich nicht für den Inhalt und die Vollständigkeit von E-Mails und den gegebenenfalls daraus entstehenden Schaden. Sollte trotz der bestehenden Viren-Schutzprogramme durch diese E-Mail ein Virus in Ihr System gelangen, so haftet die GISA GmbH - soweit gesetzlich zulässig - nicht für die hieraus entstehenden Schäden.


John N.

unread,
Jul 1, 2013, 3:52:15 AM7/1/13
to gan...@googlegroups.com
Hi Sascha,


> Read throughput on my instances (Xen PVM+ganeti instance debootstrap) is great (~1020 MB/sec)

How do you measure this? Don't know xen, but with KVM you can measure nonsense, when doing reads and have host-side caching active (kvm:disk_cache=writethrough).

To measure read throughput I used hdparm (hdparm -tT /dev/xvda). I also used bonnie++ afterwards to confirm the results. Results for read are the following:

hdparm: 1031.28 MB/sec
bonnie++: 1113753 K/sec Block Sequential Input
 
> but of course the write throughput is far behind due to DRBD I suppose (~60 MB/s). That's somehow bad. I get sequential write of 112MB/s with SAS-Disks. This is near 1Gbit.

Quite bad indeed, but now I suspect my method of measuring write throughput of being not reliable (dd). Here is the dd command I used to measure write throughput:

dd if=/dev/zero of=file bs=64k count=5000 oflag=direct

I now just checked write throughput with bonnie++ and I get 119262 K/sec Block Sequential Output which is approximately the same as the 112 MB/s you mention and the maximum one can get on a Gigabit network. So is maybe dd  not so reliable for measuring write throughput?
 
> 1) increase DRBD's sync rate to ~100 MB/s to use the full capacity of my Gigabit network (100 * 1024):

The sync rate affects only initial or resync of DRBD. It does not say how fast you can write (think of HW RAID initialisation).

Ok thanks for confirming, I wasn't aware of that but somehow suspected it.
 
> Any ideas what I can do more to get a better write performance?

Use the noop or deadline I/O scheduler on the host _and_ the guest. DRBD's activity log produces "bad patterns" for CFQ.

Is there maybe any documentation on how to change these settings?

Best,
John

Lucas, Sascha

unread,
Jul 1, 2013, 4:51:30 AM7/1/13
to gan...@googlegroups.com
Hi John,

From: John N.
Date: Mon, 1. July 2013 09:52

> hdparm: 1031.28 MB/sec
> bonnie++: 1113753 K/sec Block Sequential Input

This sounds still high (no caching effects? To be sure run benchmarks on the host). I get on 4xSSD RAID10
# hdparm -tT /dev/cciss/c0d0p2

/dev/cciss/c0d0p2:
Timing cached reads: 19316 MB in 2.00 seconds = 9666.67 MB/sec
Timing buffered disk reads: 1410 MB in 3.00 seconds = 469.65 MB/sec

# dd if=/dev/cciss/c0d0p2 of=/dev/null bs=1M count=10024
10024+0 records in
10024+0 records out
10510925824 bytes (11 GB) copied, 21.3963 s, 491 MB/s
 
> but now I suspect my method of measuring write throughput of being not reliable (dd). Here is the dd command I used to measure write throughput:
> dd if=/dev/zero of=file bs=64k count=5000 oflag=direct

Perhaps bs is too small? This is more a bs for random I/O. I do the same but with bs=1M.

> Is there maybe any documentation on how to change these settings?

https://www.kernel.org/doc/Documentation/block/switching-sched.txt

John N.

unread,
Jul 1, 2013, 8:49:14 AM7/1/13
to gan...@googlegroups.com
Hi Sascha,


> hdparm: 1031.28 MB/sec
> bonnie++: 1113753 K/sec Block Sequential Input

This sounds still high (no caching effects? To be sure run benchmarks on the host). I get on 4xSSD RAID10
# hdparm -tT /dev/cciss/c0d0p2

I can't guarantee any caching effects as the RAID card (LSI MegaRAID) has a CacheVault with BBU and 1GB of DRAM. But in my opinion and understanding it is correct that the read throughput should be approximately twice as fast with 4xSSD in RAID5 than with 4x SSD in RAID10.


/dev/cciss/c0d0p2:
 Timing cached reads:   19316 MB in  2.00 seconds = 9666.67 MB/sec
 Timing buffered disk reads:  1410 MB in  3.00 seconds = 469.65 MB/sec

# dd if=/dev/cciss/c0d0p2 of=/dev/null bs=1M count=10024
10024+0 records in
10024+0 records out
10510925824 bytes (11 GB) copied, 21.3963 s, 491 MB/s
 
> but now I suspect my method of measuring write throughput of being not reliable (dd). Here is the dd command I used to measure write throughput:
> dd if=/dev/zero of=file bs=64k count=5000 oflag=direct

Perhaps bs is too small? This is more a bs for random I/O. I do the same but with bs=1M.

I now executed the same tests as you directly on the host (Xen dom0), here are the results:

# hdparm -tT /dev/sda

/dev/sda:
 Timing cached reads:   13686 MB in  1.99 seconds = 6877.93 MB/sec
 Timing buffered disk reads: 3034 MB in  3.00 seconds = 1011.32 MB/sec

# dd if=/dev/sda of=/dev/null bs=1M count=10024
10024+0 records in
10024+0 records out
10510925824 bytes (11 GB) copied, 10.5887 s, 993 MB/s



> Is there maybe any documentation on how to change these settings?

https://www.kernel.org/doc/Documentation/block/switching-sched.txt

Thanks for the link that's exactly what I was looking for. Now I suppose that if I am already at the limit of my write throughput with ~110 MB/s using a Gigabit network, changing the IO scheduler on my node and instances won't change anything?

Best,
John

Lucas, Sascha

unread,
Jul 1, 2013, 10:22:09 AM7/1/13
to gan...@googlegroups.com
Hi John,

From: John N.
Date: Mon, 1. July 2013 14:49

> I can't guarantee any caching effects as the RAID card (LSI MegaRAID) has a CacheVault with BBU and 1GB of DRAM

I only mean cache from the OS side.

> But in my opinion and understanding it is correct that the read throughput should be approximately twice as fast with 4xSSD in RAID5 than with 4x SSD in RAID10.

Sounds right.

> Now I suppose that if I am already at the limit of my write throughput with ~110 MB/s using a Gigabit network, changing the IO scheduler on my node and instances won't change anything?

Yes and no. Yes because you have SSDs and DRBD's extra I/Os on the meta-data disk (wrongly ordered by the scheduler) won't hurt (so much). No because you measured sequential I/O. Most applications produce random I/O. To fill 1Gbit bandwidth with random I/O in a read/write mixed environment one would need tons of SSDs. My 4 disk RAID-10 has 12912KB/s write bandwidth with the "fio iometer-file-access-server" benchmark. Yes 100Mbit Ethernet would do nicely. What here counts is IOPS not bandwidth. I get 16725 IOPS read and 4188 IOPS write. With CFQ write IOPS is more than 2 times smaller (over DRBD).

John N.

unread,
Jul 1, 2013, 12:05:12 PM7/1/13
to gan...@googlegroups.com

> Now I suppose that if I am already at the limit of my write throughput with ~110 MB/s using a Gigabit network, changing the IO scheduler on my node and instances won't change anything?

Yes and no. Yes because you have SSDs and DRBD's extra I/Os on the meta-data disk (wrongly ordered by the scheduler) won't hurt (so much). No because you measured sequential I/O. Most applications produce random I/O. To fill 1Gbit bandwidth with random I/O in a read/write mixed environment one would need tons of SSDs. My 4 disk RAID-10 has 12912KB/s write bandwidth with the "fio iometer-file-access-server" benchmark. Yes 100Mbit Ethernet would do nicely. What here counts is IOPS not bandwidth. I get 16725 IOPS read and 4188 IOPS write. With CFQ write IOPS is more than 2 times smaller (over DRBD).

I get it, and yes you are totally right one often overlooks what really counts, for instance IOPS... Tomorrow, I will try out the three different IO schedulers (enabling on the node as well as the instance) and measure IOPS with iometer on the instance. Looks like noop might be the most appropriate scheduler for SSD drives.

Best,
J.

John N.

unread,
Jul 1, 2013, 5:13:08 PM7/1/13
to gan...@googlegroups.com

Yes and no. Yes because you have SSDs and DRBD's extra I/Os on the meta-data disk (wrongly ordered by the scheduler) won't hurt (so much). No because you measured sequential I/O. Most applications produce random I/O. To fill 1Gbit bandwidth with random I/O in a read/write mixed environment one would need tons of SSDs. My 4 disk RAID-10 has 12912KB/s write bandwidth with the "fio iometer-file-access-server" benchmark. Yes 100Mbit Ethernet would do nicely. What here counts is IOPS not bandwidth. I get 16725 IOPS read and 4188 IOPS write. With CFQ write IOPS is more than 2 times smaller (over DRBD).


So I now also had a go at the "fio iometer-file-access-server" benchmark and ran it 3 times on an instance configuring each time a different IO scheduler (CFQ, noop and deadline). Here are my IOPS results (read/write):

CFQ: 15236 / 3800
noop: 15909 / 3967
deadline: 16060 / 4001

To my surprise there are no differences between each of them, so I first suspect myself of having forgotten something... I have changed the IO scheduler on my node (Xen dom0) on /dev/sda which is used my RAID5 array used with LVM and DRBD. Then I also changed the IO scheduler on my instance (Xen PVM domU) on /dev/xvda. Is there anything else I need to do to switch IO scheduler? At least my results are consistent with yours but I was expecting to see a bigger difference among the various schedulers.

Best,
John

Lucas, Sascha

unread,
Jul 4, 2013, 2:25:36 AM7/4/13
to gan...@googlegroups.com
Hi John,

sorry for delayed answer...

From: John N.
Date: Mon, 1. July 2013 23:13

> To my surprise there are no differences between each of them

Yes this makes me also surprised. May I assume, you measured also on the node (dom0) without DRBD and have the same IPOS?

In the past I've seen such differences on different filesystems. I.e. CFQ has no effect on debian squeeze and ext3 but on reiserfs. It is said, that every $FS has its own optimization for rotating media and therefore could influence the block layer scheduler. Maybe different kernel versions / $FS have different behavior (are more or less SSD aware)?

My numbers / insights are from SLESS 11 / ext3 (kernel 2.6.32/3.0).

> Is there anything else I need to do to switch IO scheduler?

No that's all.

John N.

unread,
Jul 5, 2013, 6:59:20 PM7/5/13
to gan...@googlegroups.com

On Thursday, July 4, 2013 8:25:36 AM UTC+2, sascha wrote:
Hi John,

sorry for delayed answer...

Hi Sascha,

No problem, was also quite busy these days
 
Yes this makes me also surprised. May I assume, you measured also on the node (dom0) without DRBD and have the same IPOS?

Actually no I didn't but I just ran some tests right now: I created on the dom0 node directly a logical volume of 10GB for test purpose and created an ext4 partition. From that partition I used the exact same fio fileserver script with the 3 different IO schedulers. Below are the results:

CFQ: 21052 / 5277
deadline: 21887 / 5472
noop: 22170 / 5539

So again like on an instance there are hardly any differences between the 3 schedulers. What I can notice though through this test is that I get around 25-30% less IOPS on an instance with DRBD disk template than on the node. I suppose here that this performance drop is not only DRBD's "fault" but also the fact that the instance is a VM. Would still have to test on an instance using the plain disk template instead of DRBD.

In the past I've seen such differences on different filesystems. I.e. CFQ has no effect on debian squeeze and ext3 but on reiserfs. It is said, that every $FS has its own optimization for rotating media and therefore could influence the block layer scheduler. Maybe different kernel versions / $FS have different behavior (are more or less SSD aware)?

Interesting and possible yes... My tests where all done on Debian 7 with a 3.2 kernel and ext4 partitions.

Best,
John

John N.

unread,
Jul 5, 2013, 7:12:46 PM7/5/13
to gan...@googlegroups.com

So again like on an instance there are hardly any differences between the 3 schedulers. What I can notice though through this test is that I get around 25-30% less IOPS on an instance with DRBD disk template than on the node. I suppose here that this performance drop is not only DRBD's "fault" but also the fact that the instance is a VM. Would still have to test on an instance using the plain disk template instead of DRBD.

Small addition here: I just tried the same tests on an instance with a plain disk template and the fio IOPS performance are the same as on the node itself, meaning that the performance loss in IOPS (around 25-30% in my case) is only due to DRBD.

J.

Lucas, Sascha

unread,
Jul 9, 2013, 5:02:45 AM7/9/13
to gan...@googlegroups.com
Hi John,

From: John N.
Date: Sat, 6. Juli 2013 01:13

> meaning that the performance loss in IOPS (around 25-30% in my case) is only due to DRBD.

Yes. Virtualisation does not influence such kind of I/O.

However, you remember my 4xSSD RAID-10. I managed to have no performance drop over DRBD compared to a "plain" (LVM) instance, with the following settings:

* 2 Nodes equal in I/O performance
* Ethernet latency small enough (<= 1/IOPS-write); in practice you can't do more than ~10k IOPS over a single connection
* Ganeti DRBD disk parameters: disk-barriers=bf, meta-barriers=True, net-custom="--max-buffers 8000 --max-epoch-size 8000"
* deadline I/O scheduler

WRT DRBD tuning[1] disk-barriers=bf is safe on RAID+BBU. I'm unsure about disabling meta data barriers. DRBD.CONF(5) doesn't make it clear enough for me. Playing around with unplug-watermark have no affects for me.

Even if my RAID isn't as performant as yours, my general discovery was, that there must be no performance loss with DRBD.

[1] http://www.drbd.org/users-guide-8.3/s-throughput-tuning.html

John N.

unread,
Jul 9, 2013, 9:10:27 AM7/9/13
to gan...@googlegroups.com

On Tuesday, July 9, 2013 11:02:45 AM UTC+2, sascha wrote:
Hi John,

Yes. Virtualisation does not influence such kind of I/O. 

Hi Sascha,

Great, I also found your other post about DRBD where you mention having achieved the same performance on your instance as on your node, well done!
 
However, you remember my 4xSSD RAID-10. I managed to have no performance drop over DRBD compared to a "plain" (LVM) instance, with the following settings:

* 2 Nodes equal in I/O performance

Check


* Ethernet latency small enough (<= 1/IOPS-write); in practice you can't do more than ~10k IOPS over a single connection

Check, I bundle two Gigabit NICs in an EtherChanel (802.3ad) on a Cisco switch.
 
* Ganeti DRBD disk parameters: disk-barriers=bf, meta-barriers=True, net-custom="--max-buffers 8000 --max-epoch-size 8000"

I will not touch the the disk-barriers option to be on the safe side but I would like to modify the max-buffers and max-epoch size parameters. Somehow I can't find the right parameter format to do that with gnt-cluster:

# gnt-cluster modify -D drbd:net-custom="max-buffers=8000,max-epoch-size=8000"
Parameter Error: Unknown parameter 'max-epoch-size'

I also tried different formats but in vain... Have any idea how to pass these commands to the gnt-cluster modify?
 
* deadline I/O scheduler

Currently still using CFQ as in my case there are hardly any performance gain between the three different schedulers (see above for reference).
 
WRT DRBD tuning[1] disk-barriers=bf is safe on RAID+BBU. I'm unsure about disabling meta data barriers. DRBD.CONF(5) doesn't make it clear enough for me. Playing around with unplug-watermark have no affects for me.

I will not risk to disable disk-barriers even though I have UPS and RAID+BBU. This for the simple case of a hard crash (kernel panic or such).

 
Even if my RAID isn't as performant as yours, my general discovery was, that there must be no performance loss with DRBD.

Trying to achieve that too and nearly there ;-)

Cheers,
J.

Lucas, Sascha

unread,
Jul 10, 2013, 2:51:29 AM7/10/13
to gan...@googlegroups.com
Hi John,

From: John N.
Date: Tue, 9. July 2013 15:10

> Check, I bundle two Gigabit NICs in an EtherChanel (802.3ad) on a Cisco switch.

That's another story... Hopefully you have configured xmit_hash_policy="layer 3+4" and the same on the switch side. A single TCP connection can't get more then 1Gbit, but two can. LACP places the second connection (statistically) on the second NIC. Think of 2 DRBD disks for an instance, every disk having its own TCP connection. Inside the instance you stripe (LVM or software RAID) over these two disks. Now you can get 2x1Gbit sequential I/O and 2x ~10k IOPS write random I/O.
 
> I will not touch the the disk-barriers option to be on the safe side

As mentioned on the DRBD tuning site and the drbd.conf man page disabling barrier and flush is absolutely safe with BBU. All this just say how write ordering is done. Barrier does not work over LVM with old kernels (my one). And disabling flush does not mean to disable write ordering. Method drain is used.

What I'm not sure about is disabling meta flushes (should be asked on DRBD list).

> Somehow I can't find the right parameter format to do that with gnt-cluster:

$ gnt-cluster modify --disk-parameters drbd:disk-barriers=bf,meta-barriers=true,net-custom='--max-buffers 8000 --max-epoch-size 8000'

John N.

unread,
Jul 10, 2013, 5:28:40 PM7/10/13
to gan...@googlegroups.com
On Wednesday, July 10, 2013 8:51:29 AM UTC+2, sascha wrote:
> Check, I bundle two Gigabit NICs in an EtherChanel (802.3ad) on a Cisco switch.

That's another story... Hopefully you have configured xmit_hash_policy="layer 3+4" and the same on the switch side. A single TCP connection can't get more then 1Gbit, but two can. LACP places the second connection (statistically) on the second NIC. Think of 2 DRBD disks for an instance, every disk having its own TCP connection. Inside the instance you stripe (LVM or software RAID) over these two disks. Now you can get 2x1Gbit sequential I/O and 2x ~10k IOPS write random I/O.

Well I was going to use the layer3+4 xmit_hash_policy but I read in the bonding documentation that for some reasons this is not fully compliant nor recommended so I sticked to using layer2+3 as xmit_hash_policy. Are you using layer3+4? and if yes do you notice any problems? also if you have a Cisco switch behind what load-balancing would you chose to be compatible with layer3+4 for the port-channel? on my switch I have the following load-balancing options available:

  dst-ip       Dst IP Addr
  dst-mac      Dst Mac Addr
  src-dst-ip   Src XOR Dst IP Addr
  src-dst-mac  Src XOR Dst Mac Addr
  src-ip       Src IP Addr
  src-mac      Src Mac Addr

and I currently have chosen "src-dst-ip" to go along with layer2+3 on the Linux side.
 
As mentioned on the DRBD tuning site and the drbd.conf man page disabling barrier and flush is absolutely safe with BBU. All this just say how write ordering is done. Barrier does not work over LVM with old kernels (my one). And disabling flush does not mean to disable write ordering. Method drain is used.

I re-read again these specific parts of the documentation of DRBD and you are right the only critical aspect they mention is if you are running without BBU.
 
What I'm not sure about is disabling meta flushes (should be asked on DRBD list).  

With reference to the DRBD documentation (http://www.drbd.org/users-guide/s-disable-flushes.html) it seems that the same rules apply for meta data flushes as for disk flushes.
 
> Somehow I can't find the right parameter format to do that with gnt-cluster:

$ gnt-cluster modify --disk-parameters drbd:disk-barriers=bf,meta-barriers=true,net-custom='--max-buffers 8000 --max-epoch-size 8000'

Thanks that was it. Now I tried out these exact options and ran fio (iometer-file-access-server) and beleive me or not but the results are horrible: 2041 IOPS read and 509 IOPS write. I really can't explain that. Will have to find out now which option is responsible for this disastrous performance.

Best,
John


Lucas, Sascha

unread,
Jul 11, 2013, 2:36:46 AM7/11/13
to gan...@googlegroups.com
Hi John,

From: John N.
Date: Wed, 10. Juli 2013 23:29

> Are you using layer3+4?

Yes. I'm using it for years now. I'm aware about the compliance warning. But all switches I've seen so far have such a balancing policy. And it works. My interpretation is, that the balancing algorithm is not part of LACP (is not negotiated). So every sender balances with its own algorithm (hash method) and the receiver has to accept this. What counts is, that every packet of a "session" (normal traffic with IP/UDP+TCP) flows on the same path. Means no out of order packets coming faster on another path within the same session/connection.

> also if you have a Cisco switch behind what load-balancing would you chose

ATM I have Cisco behind (Nexus 5000) and it has an additional algorithm to chose: source-dest-port (Layer 4).

> With reference to the DRBD documentation (http://www.drbd.org/users-guide/s-disable-flushes.html) it seems that the same rules apply for meta data flushes as for disk flushes.

Thanks, this is the first document I've seen mentioning disk- and meta-flushes wrt. BBU at the same time.
 
> the results are horrible: 2041 IOPS read and 509 IOPS write.

Wow. I'm surprised. One evidence more, that most things needs interpretation and don't apply generally. I assume you get the same results with CFQ/deadline/noop?

John N.

unread,
Jul 11, 2013, 2:59:32 AM7/11/13
to gan...@googlegroups.com

On Thursday, July 11, 2013 8:36:46 AM UTC+2, sascha wrote:

Hi Sascha
 
> Are you using layer3+4?

Yes. I'm using it for years now. I'm aware about the compliance warning. But all switches I've seen so far have such a balancing policy. And it works. My interpretation is, that the balancing algorithm is not part of LACP (is not negotiated). So every sender balances with its own algorithm (hash method) and the receiver has to accept this. What counts is, that every packet of a "session" (normal traffic with IP/UDP+TCP) flows on the same path. Means no out of order packets coming faster on another path within the same session/connection.

Great, that's nice to have a feedback of someone who uses layer3+4 (in production I assume) since all these years without any problems. I will definitely test and change my xmit_hash_policy asap.

ATM I have Cisco behind (Nexus 5000) and it has an additional algorithm to chose: source-dest-port (Layer 4).

Luck you, I wish they would also add this in their IOS for lower switch models... Nexus are still horribly expensive.
 
> the results are horrible: 2041 IOPS read and 509 IOPS write.

Wow. I'm surprised. One evidence more, that most things needs interpretation and don't apply generally. I assume you get the same results with CFQ/deadline/noop?

I tried only with 2 schedulers: CFQ and deadline both had pretty much the same awful results so I didn't care about noop. Didn't have time yet to test more though. But yes indeed it depends on many factors and for sure also the kind of hardware used (RAID card, NIC card, etc.). So it's really more of a "game" of testing many different parameters for a specific set of hardware...

Regards,
John

John N.

unread,
Jul 11, 2013, 12:31:10 PM7/11/13
to gan...@googlegroups.com
 
> the results are horrible: 2041 IOPS read and 509 IOPS write.

Wow. I'm surprised. One evidence more, that most things needs interpretation and don't apply generally. I assume you get the same results with CFQ/deadline/noop?

So I now found out which option leads to so bad performance: meta-barriers, as soon as I set this option on my ganeti cluster to true, performance drops massively. No clue why but I will not touch this option anymore...

Best,
J.
 
Reply all
Reply to author
Forward
0 new messages