After almost two weeks of experimentation, google searches and reading of posts, bug reports and discussions I'm still far from an answer. I'm hoping someone on this list could shed some light on the subject.
I'm using a 3Ware 9500S-12 card and am able to produce up to 400MB/s sustained read from my 12-disk 4.1TB RAID5 SATA array, 128MB cache onboard, ext3 formatted. All is well when performing a single read -- it works nice and fast.
The system is a web server, serving mid-size files (50MB, each, on average). All hell breaks loose when doing many concurrent reads, anywhere between 200 to 400 concurrent streams things simply grind to a halt and the system transfers a maximum of 12-14MB/s.
I'm in the process of clearing up the array (this would take some time) and restructuring it to JBOD mode in order to use each disk individually. I will use a filesystem more suitable to streaming large files, such as XFS. But this would take time and I would very much appreciate the advice of people in the know if this is going to help at all. It's hard for me to make extreme experimentation (deleting, formatting, reformatting) as this is a productio n system with many files that I have no other place to dump until they can be safely removed. Though I'm working on dumping them slowly to other, remote, machines.
I'm running the latest kernel, 2.6.13.2 and the latest 3Ware driver, taken from the 3ware.com web site which upon insmod, updates the card's firmware to the latest version as well.
In my experiments, I've tried using larger readahead, currently at 16k (this helps, higher values do not seem to help much), using the deadline scheduler for this device, booting the system with the 'noapic' option and playing with a bunch of VM tunable parameters which I'm not sure that I should really be touching. At the moment only the readahead modification is used as the other stuff simply didn't help at all.
With the stock kernel shipped with my distribution, 2.6.8 and its old 3ware driver things were just as worse but manifested themselves differently. The system was visibly (top, vmstat...) spending most of its time in io-wait and load average was extremely high, in the area of 10 to 20. With the recent kernel and driver mentioned above, the excessive io-wait and load seems to have been resolved and observed loadavg is between 1 and 4.
I don't have much experience with systems that are supposed to stream many files concurrently off a hardware RAID of this configuration, but my gut feeling is that something is very wrong and I should be seeing a much higher read throughput.
Trying to preempt people's questions I've tried to include as much information as possible, a lot of stuff is pasted below.
I've just seen that the 3ware driver shares the same IRQ with my ethernet card, which has got me a little worried, should I be?
# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.00GHz stepping : 1 cpu MHz : 2993.035 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm pni monitor ds_c pl cid cx16 xtpr bogomips : 5993.68
processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.00GHz stepping : 1 cpu MHz : 2993.035 cache size : 1024 KB physical id : 3 siblings : 2 core id : 3 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm pni monitor ds_c pl cid cx16 xtpr bogomips : 5985.52
# lspci 0000:00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 0c) 0000:00:00.1 ff00: Intel Corporation E7525/E7520 Error Reporting Registers (rev 0c) 0000:00:01.0 System peripheral: Intel Corporation E7520 DMA Controller (rev 0c) 0000:00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 0c) 0000:00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0 c) 0000:00:05.0 PCI bridge: Intel Corporation E7520 PCI Express Port B1 (rev 0c) 0000:00:06.0 PCI bridge: Intel Corporation E7520 PCI Express Port C (rev 0c) 0000:00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02) 0000:00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02) 0000:00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 (rev 02) 0000:00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02) 0000:00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2) 0000:00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02) 0000:00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Contro ller (rev 02) 0000:00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (re v 02) 0000:00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller ( rev 02) 0000:01:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A ( rev 09) 0000:01:00.1 PIC: Intel Corporation 6700/6702PXH I/OxAPIC Interrupt Controller A (rev 09) 0000:01:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B ( rev 09) 0000:01:00.3 PIC: Intel Corporation 6700PXH I/OxAPIC Interrupt Controller B (rev 09) 0000:03:01.0 RAID bus controller: 3ware Inc 9xxx-series SATA-RAID 0000:05:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8050 Gigabit Ethernet Controller (rev 18) 0000:07:04.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet Controller (rev 05) 0000:07:0c.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
As you can see, this is a fairly recent motherboard that's supposed to perform well, though I don't know the manufacturer of this board as the machine is hosted and I don't have physical access, though I can ask them if anyone would like to know.
The ethernet card actually being used is the Intel E1000 with NAPI support compiled.
If there's any bit of information that's missing, please let me know and I'd he happy to provide it quickly.
If you can suggest a better (non Netapp, EMC, etc) solution that is somehow affordable and can provide the very high read throughputs please let me know, I'm very interested in solutions that can staturate multiple gigabit links (of course, using more than one machine ;)
Please CC me on any replies as I'm not subscribed to the list.
After almost two weeks of experimentation, google searches and reading of posts, bug reports and discussions I'm still far from an answer. I'm hoping someone on this list could shed some light on the subject.
I'm using a 3Ware 9500S-12 card and am able to produce up to 400MB/s sustained read from my 12-disk 4.1TB RAID5 SATA array, 128MB cache onboard, ext3 formatted. All is well when performing a single read -- it works nice and fast.
The system is a web server, serving mid-size files (50MB, each, on average). All hell breaks loose when doing many concurrent reads, anywhere between 200 to 400 concurrent streams things simply grind to a halt and the system transfers a maximum of 12-14MB/s.
I'm in the process of clearing up the array (this would take some time) and restructuring it to JBOD mode in order to use each disk individually. I will use a filesystem more suitable to streaming large files, such as XFS. But this would take time and I would very much appreciate the advice of people in the know if this is going to help at all. It's hard for me to make extreme experimentation (deleting, formatting, reformatting) as this is a productio n system with many files that I have no other place to dump until they can be safely removed. Though I'm working on dumping them slowly to other, remote, machines.
I'm running the latest kernel, 2.6.13.2 and the latest 3Ware driver, taken from the 3ware.com web site which upon insmod, updates the card's firmware to the latest version as well.
In my experiments, I've tried using larger readahead, currently at 16k (this helps, higher values do not seem to help much), using the deadline scheduler for this device, booting the system with the 'noapic' option and playing with a bunch of VM tunable parameters which I'm not sure that I should really be touching. At the moment only the readahead modification is used as the other stuff simply didn't help at all.
With the stock kernel shipped with my distribution, 2.6.8 and its old 3ware driver things were just as worse but manifested themselves differently. The system was visibly (top, vmstat...) spending most of its time in io-wait and load average was extremely high, in the area of 10 to 20. With the recent kernel and driver mentioned above, the excessive io-wait and load seems to have been resolved and observed loadavg is between 1 and 4.
I don't have much experience with systems that are supposed to stream many files concurrently off a hardware RAID of this configuration, but my gut feeling is that something is very wrong and I should be seeing a much higher read throughput.
Trying to preempt people's questions I've tried to include as much information as possible, a lot of stuff is pasted below.
I've just seen that the 3ware driver shares the same IRQ with my ethernet card, which has got me a little worried, should I be?
# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.00GHz stepping : 1 cpu MHz : 2993.035 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm pni monitor ds_c pl cid cx16 xtpr bogomips : 5993.68
processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.00GHz stepping : 1 cpu MHz : 2993.035 cache size : 1024 KB physical id : 3 siblings : 2 core id : 3 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm pni monitor ds_c pl cid cx16 xtpr bogomips : 5985.52
# lspci 0000:00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 0c) 0000:00:00.1 ff00: Intel Corporation E7525/E7520 Error Reporting Registers (rev 0c) 0000:00:01.0 System peripheral: Intel Corporation E7520 DMA Controller (rev 0c) 0000:00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 0c) 0000:00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0 c) 0000:00:05.0 PCI bridge: Intel Corporation E7520 PCI Express Port B1 (rev 0c) 0000:00:06.0 PCI bridge: Intel Corporation E7520 PCI Express Port C (rev 0c) 0000:00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02) 0000:00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02) 0000:00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 (rev 02) 0000:00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02) 0000:00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2) 0000:00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02) 0000:00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Contro ller (rev 02) 0000:00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (re v 02) 0000:00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller ( rev 02) 0000:01:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A ( rev 09) 0000:01:00.1 PIC: Intel Corporation 6700/6702PXH I/OxAPIC Interrupt Controller A (rev 09) 0000:01:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B ( rev 09) 0000:01:00.3 PIC: Intel Corporation 6700PXH I/OxAPIC Interrupt Controller B (rev 09) 0000:03:01.0 RAID bus controller: 3ware Inc 9xxx-series SATA-RAID 0000:05:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8050 Gigabit Ethernet Controller (rev 18) 0000:07:04.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet Controller (rev 05) 0000:07:0c.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
As you can see, this is a fairly recent motherboard that's supposed to perform well, though I don't know the manufacturer of this board as the machine is hosted and I don't have physical access, though I can ask them if anyone would like to know.
The ethernet card actually being used is the Intel E1000 with NAPI support compiled.
If there's any bit of information that's missing, please let me know and I'd he happy to provide it quickly.
If you can suggest a better (non Netapp, EMC, etc) solution that is somehow affordable and can provide the very high read throughputs please let me know, I'm very interested in solutions that can staturate multiple gigabit links (of course, using more than one machine ;)
Please CC me on any replies as I'm not subscribed to the list.
Why are you booting with 'noapic'.. in my experience that will seriously impact interrupt performance. Use the APIC if you've got it, which in this case you definitely do.
Yes, having your gigabit NIC and RAID controller on the same IRQ (in PIC mode) could definitely me a source of trouble.
In your web server testing, were you using an external traffic generator or an on-host process? If you try on-host (eliminating the network throughput and related interrupts) does performance improve?
So two biggest suggestions:
- Use the APIC. It is your friend.
- It looks like the 3ware card and gigabit nic are on different busses, but the pirq lines are being routed to the same legacy interrupt in PIC mode. So APIC mode should avoid that problem. If the controller and nic are actually on the same bus, separate them.
Regards, Ian Morgan
-- ------------------------------------------------------------------- Ian E. Morgan Vice President & C.O.O. Webcon, Inc. imorgan at webcon dot ca PGP: #2DA40D07 www.webcon.ca * Customized Linux network solutions for your business * -------------------------------------------------------------------
> After almost two weeks of experimentation, google > searches and reading of posts, bug reports and > discussions I'm still far from an answer. I'm hoping > someone on this list could shed some light on the > subject.
> I'm using a 3Ware 9500S-12 card and am able to produce > up to 400MB/s sustained read from my 12-disk 4.1TB > RAID5 SATA array, 128MB cache onboard, ext3 formatted. > All is well when performing a single read -- it > works nice and fast.
> The system is a web server, serving mid-size files > (50MB, each, on average). All hell breaks loose when > doing many concurrent reads, anywhere between 200 to > 400 concurrent streams things simply grind to a halt > and the system transfers a maximum of 12-14MB/s.
> I'm in the process of clearing up the array (this > would take some time) and restructuring it to JBOD > mode in order to use each disk individually. I will > use a filesystem more suitable to streaming large > files, such as XFS. But this would take time and I > would very much appreciate the advice of people in the > know if this is going to help at all. It's hard for > me to make extreme experimentation (deleting, > formatting, reformatting) as this is a productio n > system with many files that I have no other place to > dump until they can be safely removed. Though I'm > working on dumping them slowly to other, remote, > machines.
> I'm running the latest kernel, 2.6.13.2 and the latest > 3Ware driver, taken from the 3ware.com web site which > upon insmod, updates the card's firmware to the latest > version as well.
> In my experiments, I've tried using larger readahead, > currently at 16k (this helps, higher values do not > seem to help much), using the deadline scheduler for > this device, booting the system with the 'noapic' > option and playing with a bunch of VM tunable > parameters which I'm not sure that I should really be > touching. At the moment only the readahead > modification is used as the other stuff simply didn't > help at all.
> With the stock kernel shipped with my distribution, > 2.6.8 and its old 3ware driver things were just as > worse but manifested themselves differently. The > system was visibly (top, vmstat...) spending most of > its time in io-wait and load average was extremely > high, in the area of 10 to 20. With the recent > kernel and driver mentioned above, the excessive > io-wait and load seems to have been resolved and > observed loadavg is between 1 and 4.
> I don't have much experience with systems that are > supposed to stream many files concurrently off a > hardware RAID of this configuration, but my gut > feeling is that something is very wrong and I should > be seeing a much higher read throughput.
> Trying to preempt people's questions I've tried to > include as much information as possible, a lot of > stuff is pasted below.
> I've just seen that the 3ware driver shares the same > IRQ with my ethernet card, which has got me a little > worried, should I be?
On Thu, Sep 29, 2005 at 11:50:58PM -0700, you [subbie subbie] wrote: > Dear list,
> After almost two weeks of experimentation, google > searches and reading of posts, bug reports and > discussions I'm still far from an answer. I'm hoping > someone on this list could shed some light on the > subject.
Hi,
We're having similar problems with 9500S-4LP and two Hitachi 250GB SATA disks.
Currently we are running 2.6.12.5 and its 3w-9xxx driver. We have tried numerous 2.4 (Red Hat and kernel.org) and 2.6 kernels and 3w-9xxx drivers (kernel and 3ware.com).
The results it more or less the same: on 2.4 it corrupts data and is slow, on 2.6 it doesn't corrupt data, but is slow.
Our workload is VMWare GSX server. Multiple readers/writers will grind the performance to halt.
'noapic' was a recommendation by 3Ware / AMCC tech support. It did not help at all, as expected. Unfortunately they did not have any other recommendations.
I've now removed 'noapic' and unfortunately nothing has changed, really. See current stats below.
> Why are you booting with 'noapic'.. in my experience > that will seriously > impact interrupt performance. Use the APIC if you've > got it, which in this > case you definitely do.
> Yes, having your gigabit NIC and RAID controller on > the same IRQ (in PIC > mode) could definitely me a source of trouble.
> In your web server testing, were you using an > external traffic generator or > an on-host process? If you try on-host (eliminating > the network throughput > and related interrupts) does performance improve?
> So two biggest suggestions:
> - Use the APIC. It is your friend.
> - It looks like the 3ware card and gigabit nic are > on different busses, but > the pirq lines are being routed to the same legacy > interrupt in PIC mode. So > APIC mode should avoid that problem. If the > controller and nic are actually > on the same bus, separate them.
> Ian E. Morgan Vice President & C.O.O. > Webcon, Inc. > imorgan at webcon dot ca PGP: #2DA40D07 > www.webcon.ca > * Customized Linux network solutions for your > business *
> > After almost two weeks of experimentation, google > > searches and reading of posts, bug reports and > > discussions I'm still far from an answer. I'm > hoping > > someone on this list could shed some light on the > > subject.
> > I'm using a 3Ware 9500S-12 card and am able to > produce > > up to 400MB/s sustained read from my 12-disk 4.1TB > > RAID5 SATA array, 128MB cache onboard, ext3 > formatted. > > All is well when performing a single read -- it > > works nice and fast.
> > The system is a web server, serving mid-size files > > (50MB, each, on average). All hell breaks loose > when > > doing many concurrent reads, anywhere between 200 > to > > 400 concurrent streams things simply grind to a > halt > > and the system transfers a maximum of 12-14MB/s.
> > I'm in the process of clearing up the array (this > > would take some time) and restructuring it to JBOD > > mode in order to use each disk individually. I > will > > use a filesystem more suitable to streaming large > > files, such as XFS. But this would take time and > I > > would very much appreciate the advice of people in > the > > know if this is going to help at all. It's hard > for > > me to make extreme experimentation (deleting, > > formatting, reformatting) as this is a productio n > > system with many files that I have no other place > to > > dump until they can be safely removed. Though I'm > > working on dumping them slowly to other, remote, > > machines.
> > I'm running the latest kernel, 2.6.13.2 and the > latest > > 3Ware driver, taken from the 3ware.com web site > which > > upon insmod, updates the card's firmware to the > latest > > version as well.
> > In my experiments, I've tried using larger > readahead, > > currently at 16k (this helps, higher values do not > > seem to help much), using the deadline scheduler > for > > this device, booting the system with the 'noapic' > > option and playing with a bunch of VM tunable > > parameters which I'm not sure that I should really > be > > touching. At the moment only the readahead > > modification is used as the other stuff simply > didn't > > help at all.
> > With the stock kernel shipped with my > distribution, > > 2.6.8 and its old 3ware driver things were just as > > worse but manifested themselves differently. > The > > system was visibly (top, vmstat...) spending most > of > > its time in io-wait and load average was extremely > > high, in the area of 10 to 20. With the recent > > kernel and driver mentioned above, the excessive > > io-wait and load seems to have been resolved and > > observed loadavg is between 1 and 4.
> > I don't have much experience with systems that are > > supposed to stream many files concurrently off a > > hardware RAID of this configuration, but my gut > > feeling is that something is very wrong and I > should > > be seeing a much higher read throughput.
> > Trying to preempt people's questions I've tried to > > include as much information as possible, a lot of > > stuff is pasted below.
> > I've just seen that the 3ware driver shares the > same > > IRQ with my ethernet card, which has got me a > little > > worried, should I be?
I've got about 30MB/s from a single threaded version of my backup code - which seems rather on the low side for a modern RAID-5; with multiple writers I was seeing sub-5MB/s but that might be fair if it is seeking everywhere.
I'd be interested to hear how your experiments with jbod'ing them go.
Dave -- -----Open up your eyes, open up your mind, open up your code ------- / Dr. David Alan Gilbert | Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex / \ _________________________|_____ http://www.treblig.org |_______/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
A single thread writing at 30MB/s is still not on par with 3ware's specs.
I see that you're also running RAID5 and in this case 3ware did report bad write performance on RAID5 and that was fixed with recent firmwares.
The latest linux driver off their website also includes the latest firmware inside it and flashes the card upon load, make sure to use that.
I'm getting a little over 50MB/s when writing to my RAID volume when completely idle, there's no reason why you should get less.
As for read performance, nothing helps with many concurrent reads, what I get is simply aweful performance no matter what I do.
I'll let you guys know once I try JBOD (as soon as all the data is moved away).
According to Ville answering me privately:
> Unfortunately, it's not limited to just that
firmware version or kernel version or driver version. I've tried several firmwares, 2.4.x and 2.6.x kernels and driver version - no salvage.
I do agree, which leads me to believe this is something very specific with the RAID controller itself or its firmware.
Maybe something is so badly designed in this controller that it can't physically do better than that ? Anyone has experience with this controller and its performance on windoze?
Can someone in the know give us some input ?
Thanks
--- "Dr. David Alan Gilbert" <d...@treblig.org> wrote:
> > Something is very wrong with this card / driver / > > firmware and or kernel combination, hopefully > someone > > can help out.
> I've got about 30MB/s from a single threaded version > of my > backup code - which seems rather on the low side for > a modern RAID-5; with multiple writers I was seeing > sub-5MB/s > but that might be fair if it is seeking everywhere.
> I'd be interested to hear how your experiments with > jbod'ing them > go.
> Dave > -- > -----Open up your eyes, open up your mind, open up > your code ------- > / Dr. David Alan Gilbert | Running GNU/Linux on > Alpha,68K| Happy \ > \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC > & HPPA | In Hex / > \ _________________________|_____ > http://www.treblig.org |_______/
> A single thread writing at 30MB/s is still not on par > with 3ware's specs.
> I see that you're also running RAID5 and in this case > 3ware did report bad write performance on RAID5 and > that was fixed with recent firmwares.
> The latest linux driver off their website also > includes the latest firmware inside it and flashes the > card upon load, make sure to use that.
I've got driver/firmware that is about 2months old that certainly helped; prior to that I was getting card timeouts (although I also upgraded the e1000 driver at the same time so it might have been that rather than the 3ware that helped). (Note: I don't expect a driver to perform a dangerous operation like firmware flashing on boot!)
> I'm getting a little over 50MB/s when writing to my > RAID volume when completely idle, there's no reason > why you should get less.
Well my ~30MB/s is sucking over gig ether and writing in 10MB chunks; but still 50MB/s for RAID5 feels like it sucks.
> I'll let you guys know once I try JBOD (as soon as all > the data is moved away).
Nod.
Dave -- -----Open up your eyes, open up your mind, open up your code ------- / Dr. David Alan Gilbert | Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex / \ _________________________|_____ http://www.treblig.org |_______/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> I've got driver/firmware that is about 2months old > that > certainly helped; prior to that I was getting card > timeouts (although I also upgraded the e1000 driver > at the same time so it might have been that rather > than the 3ware that helped).
Two months old might be too old, make sure you have at least 9.2.1.1
> (Note: I don't expect a driver to perform a > dangerous > operation like firmware flashing on boot!)
I believe they are writing to NVRAM or similar so that makes it much less risky, otherwise they wouldn't have to write it on each load ... or I might be wrong and they do it once, not sure.
> > I'm getting a little over 50MB/s when writing to > my > > RAID volume when completely idle, there's no > reason > > why you should get less.
> Well my ~30MB/s is sucking over gig ether and > writing > in 10MB chunks; but still 50MB/s for RAID5 feels > like > it sucks.
Yes, that's terrible.
Reading the release notes for the latest one, 9.3.0 that also supports their very latest controller, 9550SX, I get the feeling that write performance is something of an on-going issue with this family of controllers even though they disguise it in this latest driver version as a tweak in write performance.
Another thing, the BBU (battery backup unit) that you are using was also a can of worms for them so this makes it even more important for you to upgrade to the very latest version ,they seem to be having ongoing issues with that one.
I have seen some systems on which IRQ load balancing can have a detrimental effect on some devices such as gigabit Ethernet etc.
You could try disabling both the irqbalance userspace daemon (if that's part of your distribution), and in-kernel IRQ balancing, if enabled (CONFIG_IRQBALANCE).
For your NIC, try enabling NAPI interrupt mitigation, if available. This will significantly reduce the interrupt load under high traffic volume.
I guess there's another obvious question that I forgot: Do you have the 3ware cache enabled or disabled? Are your ext3 filesystems mounted with the 'noatime' option?
Regards, Ian Morgan
-- ------------------------------------------------------------------- Ian E. Morgan Vice President & C.O.O. Webcon, Inc. imorgan at webcon dot ca PGP: #2DA40D07 www.webcon.ca * Customized Linux network solutions for your business * -------------------------------------------------------------------
On Tue, 4 Oct 2005, subbie subbie wrote: > Dear Ian,
> 'noapic' was a recommendation by 3Ware / AMCC tech > support. It did not help at all, as expected. > Unfortunately they did not have any other > recommendations.
> I've now removed 'noapic' and unfortunately nothing > has changed, really. See current stats below.
> I have also tried playing with the parameters our > friend Ville has mentioned in his post, nothing has > come out of it.
> I'm willing to give any developer here access to my > production machine so that they can see the situation > first hand. Performance is just aweful.
> I'm planning to ditch RAID5 on this card and try JBOD > and in spreading files evenly across my 12 disks, > hopefully this would give some benifit.
> Something is very wrong with this card / driver / > firmware and or kernel combination, hopefully someone > can help out.
>> Why are you booting with 'noapic'.. in my experience >> that will seriously >> impact interrupt performance. Use the APIC if you've >> got it, which in this >> case you definitely do.
>> Yes, having your gigabit NIC and RAID controller on >> the same IRQ (in PIC >> mode) could definitely me a source of trouble.
>> In your web server testing, were you using an >> external traffic generator or >> an on-host process? If you try on-host (eliminating >> the network throughput >> and related interrupts) does performance improve?
>> So two biggest suggestions:
>> - Use the APIC. It is your friend.
>> - It looks like the 3ware card and gigabit nic are >> on different busses, but >> the pirq lines are being routed to the same legacy >> interrupt in PIC mode. So >> APIC mode should avoid that problem. If the >> controller and nic are actually >> on the same bus, separate them.
>> Regards, >> Ian Morgan
>> --
> ------------------------------------------------------------------- >> Ian E. Morgan Vice President & C.O.O. >> Webcon, Inc. >> imorgan at webcon dot ca PGP: #2DA40D07 >> www.webcon.ca >> * Customized Linux network solutions for your >> business *
>>> After almost two weeks of experimentation, google >>> searches and reading of posts, bug reports and >>> discussions I'm still far from an answer. I'm >> hoping >>> someone on this list could shed some light on the >>> subject.
>>> I'm using a 3Ware 9500S-12 card and am able to >> produce >>> up to 400MB/s sustained read from my 12-disk 4.1TB >>> RAID5 SATA array, 128MB cache onboard, ext3 >> formatted. >>> All is well when performing a single read -- it >>> works nice and fast.
>>> The system is a web server, serving mid-size files >>> (50MB, each, on average). All hell breaks loose >> when >>> doing many concurrent reads, anywhere between 200 >> to >>> 400 concurrent streams things simply grind to a >> halt >>> and the system transfers a maximum of 12-14MB/s.
>>> I'm in the process of clearing up the array (this >>> would take some time) and restructuring it to JBOD >>> mode in order to use each disk individually. I >> will >>> use a filesystem more suitable to streaming large >>> files, such as XFS. But this would take time and >> I >>> would very much appreciate the advice of people in >> the >>> know if this is going to help at all. It's hard >> for >>> me to make extreme experimentation (deleting, >>> formatting, reformatting) as this is a productio n >>> system with many files that I have no other place >> to >>> dump until they can be safely removed. Though I'm >>> working on dumping them slowly to other, remote, >>> machines.
>>> I'm running the latest kernel, 2.6.13.2 and the >> latest >>> 3Ware driver, taken from the 3ware.com web site >> which >>> upon insmod, updates the card's firmware to the >> latest >>> version as well.
>>> In my experiments, I've tried using larger >> readahead, >>> currently at 16k (this helps, higher values do not >>> seem to help much), using the deadline scheduler >> for >>> this device, booting the system with the 'noapic' >>> option and playing with a bunch of VM tunable >>> parameters which I'm not sure that I should really >> be >>> touching. At the moment only the readahead >>> modification is used as the other stuff simply >> didn't >>> help at all.
>>> With the stock kernel shipped with my >> distribution, >>> 2.6.8 and its old 3ware driver things were just as >>> worse but manifested themselves differently. >> The >>> system was visibly (top, vmstat...) spending most >> of >>> its time in io-wait and load average was extremely >>> high, in the area of 10 to 20. With the recent >>> kernel and driver mentioned above, the excessive >>> io-wait and load seems to have been resolved and >>> observed loadavg is between 1 and 4.
>>> I don't have much experience with systems that are >>> supposed to stream many files concurrently off a >>> hardware RAID of this configuration, but my gut >>> feeling is that something is very wrong and I >> should >>> be seeing a much higher read throughput.
>>> Trying to preempt people's questions I've tried to >>> include as much information as possible, a lot of >>> stuff is pasted below.
>>> I've just seen that the 3ware driver shares the >> same >>> IRQ with my ethernet card, which has got me a >> little >>> worried, should I be?
In article <20050930065058.84446.qm...@web30315.mail.mud.yahoo.com>, subbie subbie <subbie_sub...@yahoo.com> wrote:
>I'm using a 3Ware 9500S-12 card and am able to produce >up to 400MB/s sustained read from my 12-disk 4.1TB >RAID5 SATA array, 128MB cache onboard, ext3 formatted. > All is well when performing a single read -- it >works nice and fast.
>The system is a web server, serving mid-size files >(50MB, each, on average). All hell breaks loose when >doing many concurrent reads, anywhere between 200 to >400 concurrent streams things simply grind to a halt >and the system transfers a maximum of 12-14MB/s.
There are a couple of things you should do:
1. Use the CFQ I/O scheduler, and increase nr-requests: echo cfq > /sys/block/hda/queue/scheduler echo 1024 > /sys/block/hda/queue/nr_requests
2. Make sure that your filesystem knows about the stripe size and number of disks in the array. E.g. for a raid5 array with a stripe size of 64K and 6 disks (effectively 5, because in every stripe-set there is on disk doing parity):
# ext3 fs, 5 disks, 64K stripe, units in 4K blocks mkfs -text3 -E stride=$((64/4))
# xfs, 5 disks, 64K stripe, units in 512 bytes mkfs -txfs -d sunit=$((64*2)) -d swidth=$((5*64*2))
3. Don't use partitions. Partions do not start on a multiple of the (stripe_size * nr_disks), so your I/O will be misaligned and the settings in (2) will have no or an adverse effect. If you must use partitions, either build them manually with sfdisk so that partitions do start on that multiple, or use LVM.
4. Reconsider your stripe size for streaming large files. If you have say 4 disks, and a 64K stripe size, then a read of a block of 256K will busy all 4 disks. Many simultaneous threads reading blocks of 256K will result in trashings disks as they all want to read from all 4 disks .. so in that case, using a stripesize of 256K will make things better. One read of 256K (in the ideal, aligned case) will just keep one disk busy. 4 reads can happen in parallel without trashing. Esp. in this case, you need the alignment I talked about in (3).
5. Defragment the files. If the files are written sequentially, they will not be fragmented. But if they were stored by writing to thousands of them appending a few K at a time in round-robin fashion, you need to defragment.. in the case of XFS, run xfs_fsr every so often.
> I have seen some systems on which IRQ load balancing > can have a detrimental > effect on some devices such as gigabit Ethernet etc.
> You could try disabling both the irqbalance > userspace daemon (if that's part > of your distribution), and in-kernel IRQ balancing, > if enabled > (CONFIG_IRQBALANCE).
I don't have a userspace daemon for that, but I'll try the kernel option.
> For your NIC, try enabling NAPI interrupt > mitigation, if available. This > will significantly reduce the interrupt load under > high traffic volume.
It's always enabled in my configs.
> I guess there's another obvious question that I > forgot: Do you have the > 3ware cache enabled or disabled? Are your ext3 > filesystems mounted with the > 'noatime' option?
Write caching is enabled. I don't have much activity across thousands of files so noatime is less ciritical, but the RAID volume is still mounted noatime.
So basically I'll try the irq load balancing and see whath happens.
At the time that I wrote the tool, 18 months ago, both ext3 and reiserfsV3 performed fairly badly at handling concurrent writes and only JFS and XFS excelled. Since then I believe the ext3 performance has been greatly improved due to the block reservation scheme added in 2.6.10. AFAIK the reiserfs performance is only addressed in reiserfsV4.
The test code is fairly trivial and could be easily adapted to simulate other workloads (like a web server) to help to optimise your filesystem and driver performance.
I now dumped RAID5 and am running all of my 12 disks separately each partitioned with XFS.
I did a very crude test of reading a single 1GB file from each of my disks in parallel by putting 12 dd processes into the background. Each file was read at approximately 35MB/s giving an aggragate of a little over 400MB/s. According to 3Ware support, 400MB/s is the "theoretical maximum" of this controller. I'm very happy with these results.
I want to run a killer test where 400 files are being read in parallel to see what the combined throughput would be. Can anyone recommend a benchmark utility that would help me do so? I tried using bonnie/iozone but they (to my limited understanding) won't do this.
> I now dumped RAID5 and am running all of my 12 disks > separately each partitioned with XFS.
> I did a very crude test of reading a single 1GB file > from each of my disks in parallel by putting 12 dd > processes into the background. Each file was read at > approximately 35MB/s giving an aggragate of a little > over 400MB/s. According to 3Ware support, 400MB/s is > the "theoretical maximum" of this controller. I'm > very happy with these results.
Nice. Have you tried Software RAID5 on top of that? I would be very interested to know how software RAID5 goes relative to the 3Ware hardware.
Dave -- -----Open up your eyes, open up your mind, open up your code ------- / Dr. David Alan Gilbert | Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex / \ _________________________|_____ http://www.treblig.org |_______/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Mon, 10 Oct 2005, Dr. David Alan Gilbert wrote:
> Nice. Have you tried Software RAID5 on top of that? I would be very > interested to know how software RAID5 goes relative to the 3Ware > hardware.
There have been hundreds of email regarding this on the linux-r...@vger.kernel.org list. Please look in the archives.
It's well known that 3ware hw raid is slow when writing, current theory is that this is due to the lack of buffering meaning that any write makes it read a lot as well, destroying performance. Generally, the performance numbers advertised by 3ware when writing is a dd to the drive itself (I got this number after doing a support request on it a few years back), without a filesystem. This goes very quickly, but writing files on a filesystem is usually very slow (10 megabyte/s or so). When doing SW raid the SW layer has access to the memory block cache and can thus avoid a lot of physical reads on the drives.
I never had any problems getting good read speeds on the HW raid.
My experience is with the 7500 series, the 9500 series has cache as well but this doesn't seem to have solved a lot of the performance problems seen with the 7500 series.