Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

dd(1) performance when copiing a disk to another

6 views
Skip to first unread message

Patrick Proniewski

unread,
Oct 2, 2005, 10:57:09 AM10/2/05
to freebsd-p...@freebsd.org
Hi,
(carte mère supermicro
chip SATA Intel 6300ESB)

I run FreeBSD 5.4 on a PIV 3GHz (SuperMicro motherboard, Intel SATA
6300ESB chipset) with 2 SATA HDD. I'm in the process to duplicate the
boot HDD to the second HDD. I run dd for that:

# dd if=/dev/ad4 of=/dev/ad6 bs=1m

It yields to poor performances:

$ iostat -dhKw 1
(...)
ad4 ad6
KB/t tps MB/s KB/t tps MB/s
124.49 252 30.69 128.00 246 30.69
128.00 285 35.64 128.00 279 34.90
128.00 282 35.27 128.00 283 35.40
(...)

Is it normal that data rate won't go upper than 35/38 MB/s ?

HDDs are: ad4 -> Maxtor 80 Go 7200 rpm
ad6 -> Hitachi 80 Go 7200 rpm

one more question: is dd(1) a good way to duplicate a boot drive to
make a bootable spare disk ?

patpro_______________________________________________
freebsd-p...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "freebsd-perform...@freebsd.org"

Steven Hartland

unread,
Oct 2, 2005, 11:15:58 AM10/2/05
to freebsd-p...@freebsd.org, Patrick Proniewski
That's actually pretty good for a sustained read / write on a single disk.

Steve


================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone (023) 8024 3137
or return the E.mail to postm...@multiplay.co.uk.

Arne Wörner

unread,
Oct 2, 2005, 11:59:26 AM10/2/05
to Steven Hartland, freebsd-p...@freebsd.org, Patrick Proniewski
--- Steven Hartland <kil...@multiplay.co.uk> wrote:
> From: "Patrick Proniewski" <pat...@patpro.net>

>> # dd if=/dev/ad4 of=/dev/ad6 bs=1m
>>
>> It yields to poor performances:
>>
> That's actually pretty good for a sustained read / write on a
> single disk.
>
Does somebody know, why this is "pretty good"? I mean: Where is
the bottleneck?

As far as I know, SATA is quite fast... And memory to memory
copies are quite fast... disc<->memory should be quite fast, too.

>> Is it normal that data rate won't go upper than 35/38 MB/s ?
>>

Hmm...

Can u find out, if DMA transfers are enabled for those discs?
What does dmesg say?
What does "sysctl hw.ata.ata_dma" say?
Maybe atacontrol(8) says something useful about SATA discs, too
(e. g. atacontrol mode 0)?

Can u try the following commands, when the system (especially the
discs) is idle?
#dd if=/dev/ad4 of=/dev/null bs=1m count=1000
#dd if=/dev/zero of=/dev/null bs=1m count=1000

(Maybe you could find a way to copy /dev/zero to /dev/ad6 without
destroying the previous work... :-))
E. g.:
# dd if=/dev/ad6 of=/tmp/arne bs=1m count=1000
# dd if=/dev/zero of=/dev/ad6 bs=1m count=1000
# dd if=/tmp/arne of=/dev/ad6 bs=1m count=1000
)

> one more question: is dd(1) a good way to duplicate a boot
> drive to make a bootable spare disk ?
>

I say, is the file system on /dev/ad4 read only during the "dd"?
If /dev/ad4 changes before "dd" completes, ad6 might need a fsck
or ad6 might be useless...

Btw.: I use gmirror(8)... But then an unintentional, fatal change
von ad4 would be fatal for ad6, too... :-)) So I have to hope,
that I do not type things, I shall not type (luckily I have some
boot CDs for that unlikely case ;-)) )...

-Arne


__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com

Patrick Proniewski

unread,
Oct 2, 2005, 12:34:29 PM10/2/05
to Arne "Wörner", freebsd-p...@freebsd.org, Steven Hartland
Hi,

> Can u find out, if DMA transfers are enabled for those discs?
> What does dmesg say?

see end of mail for full dmesg output,


> What does "sysctl hw.ata.ata_dma" say?

hw.ata.ata_dma: 1


> Maybe atacontrol(8) says something useful about SATA discs, too
> (e. g. atacontrol mode 0)?

# atacontrol mode 0
Master = BIOSPIO
Slave = BIOSPIO


> Can u try the following commands, when the system (especially the
> discs) is idle?
> #dd if=/dev/ad4 of=/dev/null bs=1m count=1000

# dd if=/dev/ad4 of=/dev/null bs=1m count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 17.647464 secs (59417943 bytes/sec)

> #dd if=/dev/zero of=/dev/null bs=1m count=1000

# dd if=/dev/zero of=/dev/null bs=1m count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 0.199381 secs (5259154109 bytes/sec)

> (Maybe you could find a way to copy /dev/zero to /dev/ad6 without
> destroying the previous work... :-))

well, not very easy both disk are the same size ;)


>> one more question: is dd(1) a good way to duplicate a boot
>> drive to make a bootable spare disk ?

> I say, is the file system on /dev/ad4 read only during the "dd"?
> If /dev/ad4 changes before "dd" completes, ad6 might need a fsck
> or ad6 might be useless...

well, ad4 is not read only, but I've shutdown every unnecessary
services, and finally the ad6 hdd is bootable ! It boots ok and every
things seems to work as well as on the ad4 disk. It's ok for me, it's
just a spare emergency disk.

thanks,

Pat

dmesg :

Copyright (c) 1992-2005 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights
reserved.
FreeBSD 5.4-RELEASE-p6 #0: Mon Aug 29 15:58:58 CEST 2005
ro...@toto.patpro.net:/usr/obj/usr/src/sys/PATPRO-20050829
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Pentium(R) 4 CPU 3.00GHz (2994.90-MHz 686-class CPU)
Origin = "GenuineIntel" Id = 0xf41 Stepping = 1

Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE
,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Hyperthreading: 2 logical CPUs
real memory = 1072562176 (1022 MB)
avail memory = 1044230144 (995 MB)
ACPI APIC Table: <IntelR AWRDACPI>
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
cpu0 (BSP): APIC ID: 0
cpu1 (AP): APIC ID: 1
ioapic0: Changing APIC ID to 2
ioapic0 <Version 2.0> irqs 0-23 on motherboard
ioapic1 <Version 2.0> irqs 24-47 on motherboard
npx0: <math processor> on motherboard
npx0: INT 16 interface
acpi0: <IntelR AWRDACPI> on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
acpi_button0: <Power Button> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> at device 3.0 on pci0
pci1: <ACPI PCI bus> on pcib1
em0: <Intel(R) PRO/1000 Network Connection, Version - 1.7.35> port
0xc000-0xc01f mem 0xf2000000-0xf201ffff irq 18 at device 1.0 on pci1
em0: Ethernet address: 00:30:48:83:ef:8c
em0: Speed:N/A Duplex:N/A
pcib2: <ACPI PCI-PCI bridge> at device 28.0 on pci0
pci2: <ACPI PCI bus> on pcib2
uhci0: <UHCI (generic) USB controller> port 0xe100-0xe11f irq 16 at
device 29.0 on pci0
usb0: <UHCI (generic) USB controller> on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: <UHCI (generic) USB controller> port 0xe000-0xe01f irq 19 at
device 29.1 on pci0
usb1: <UHCI (generic) USB controller> on uhci1
usb1: USB revision 1.0
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
pci0: <base peripheral> at device 29.4 (no driver attached)
pci0: <base peripheral, interrupt controller> at device 29.5 (no
driver attached)
ehci0: <EHCI (generic) USB 2.0 controller> mem 0xf2100000-0xf21003ff
irq 23 at device 29.7 on pci0
usb2: EHCI version 1.0
usb2: companion controllers, 2 ports each: usb0 usb1
usb2: <EHCI (generic) USB 2.0 controller> on ehci0
usb2: USB revision 2.0
uhub2: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub2: 4 ports with 4 removable, self powered
pcib3: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci3: <ACPI PCI bus> on pcib3
pci3: <display, VGA> at device 9.0 (no driver attached)
em1: <Intel(R) PRO/1000 Network Connection, Version - 1.7.35> port
0xd100-0xd13f mem 0xf1000000-0xf101ffff irq 19 at device 10.0 on pci3
em1: Ethernet address: 00:30:48:83:ef:8d
em1: Speed:N/A Duplex:N/A
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel 6300ESB UDMA100 controller> port 0xf000-0xf00f,
0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 31.1 on pci0
ata0: channel #0 on atapci0
ata1: channel #1 on atapci0
atapci1: <Intel 6300ESB SATA150 controller> port 0xe600-0xe60f,
0xe500-0xe503,0xe400-0xe407,0xe300-0xe303,0xe200-0xe207 irq 18 at
device 31.2 on pci0
ata2: channel #0 on atapci1
ata3: channel #1 on atapci1
pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
acpi_tz0: <Thermal Zone> on acpi0
fdc0: <floppy drive controller> port 0x3f7,0x3f0-0x3f5 irq 6 drq 2 on
acpi0
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10
on acpi0
sio0: type 16550A
sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
sio1: type 16550A
ppc0: <Standard parallel printer port> port 0x778-0x77b,0x378-0x37f
irq 7 on acpi0
ppc0: Generic chipset (NIBBLE-only) in COMPATIBLE mode
ppbus0: <Parallel port bus> on ppc0
ppi0: <Parallel I/O> on ppbus0
atkbdc0: <Keyboard controller (i8042)> port 0x64,0x60 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
pmtimer0 on isa0
orm0: <ISA Option ROM> at iomem 0xc0000-0xc7fff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on
isa0
Timecounters tick every 10.000 msec
em0: Link is up 100 Mbps Full Duplex
ad4: 78167MB <Maxtor 6Y080M0/YAR51HW0> [158816/16/63] at ata2-master
SATA150
ad6: 194481MB <Maxtor 6L200M0/BANC1E00> [395136/16/63] at ata3-master
SATA150 <-- this is _not_ the ad6 I've used dd on

this is my regular ad6 storage disk.
SMP: AP CPU #1 Launched!
Mounting root from ufs:/dev/ad4s1a
em0: Link is up 100 Mbps Full Duplex
Accounting enabled
pflog0: promiscuous mode enabled
em0: Link is up 100 Mbps Full Duplex
em0: Link is up 100 Mbps Full Duplex

Arne Wörner

unread,
Oct 2, 2005, 1:04:46 PM10/2/05
to Patrick Proniewski, freebsd-p...@freebsd.org
Hi!

--- Patrick Proniewski <pat...@patpro.net> wrote:
> > Can u find out, if DMA transfers are enabled for those discs?
> > What does dmesg say?
>
> see end of mail for full dmesg output,
>

Looks good... :-)) But I never saw FBSD's kernel messages about
SATA drives... ;-)

> > Maybe atacontrol(8) says something useful about SATA discs,
> > too (e. g. atacontrol mode 0)?
>
> # atacontrol mode 0
> Master = BIOSPIO
> Slave = BIOSPIO
>

Hmm... 0 seems to be the wrong ata... Thats why the output does
not fit to SATA drives, I think...

> # dd if=/dev/ad4 of=/dev/null bs=1m count=1000
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes transferred in 17.647464 secs (59417943
> bytes/sec)
>

That seems to be 2 or about 2 times faster than disc->disc
transfer... But still slower, than I would have expected...
SATA150 sounds like the drive can do 150MB/sec...

As far as I know, SATA busses are independant from each other (no
master/slave; every drive gets its own cable)... Maybe "dd" cannot
issue a read request, while the write isn't completed? DMA
shouldn't be the problem, since the memory interface is quite fast
in your case...

So there remain the questions:
1. Why does the read speed drop in ur setting (maybe writing to
ad6 takes more time than reading from ad4? u could try to run two
dd processes one with if=ad4 and the other with if=ad6)?
2. Why can't we reach 150MB/sec?

> > (Maybe you could find a way to copy /dev/zero to /dev/ad6
> > without destroying the previous work... :-))
>
> well, not very easy both disk are the same size ;)
>

I thought of the first 1000 1MB blocks... :-)
The write speed might be interesting...

Eric Anderson

unread,
Oct 2, 2005, 1:44:02 PM10/2/05
to "Arne \"Wörner\"", freebsd-p...@freebsd.org, Patrick Proniewski

The reason why 35-40MB/s is good is because the drive itself cannot
stream any faster. SATA-150 interface is rated at 150MB/s, but the disk
cannot get close. Look at the specs for the drive, and you'll see that
the sustained rate is much lower than the burst speed. If you want fast
performance on a SATA disk, you'll need to buy a WD Raptor drive (74GB)
- that will get you more speed, but still not the 150MB/s.

>>>(Maybe you could find a way to copy /dev/zero to /dev/ad6
>>>without destroying the previous work... :-))
>>
>>well, not very easy both disk are the same size ;)
>>
>
> I thought of the first 1000 1MB blocks... :-)
> The write speed might be interesting...

Instead of dd, why not use gmirror?

Also - reads can be faster since the drive can read-ahead a number of
blocks into the cache in an efficient manner, but writes have to be
streamed to disk as they come in (going through the cache, and
buffering, but you get the idea).

Have you tried a smaller block size? What does 8k, 16k, or 512k do for
you? There really isn't much room for improvement here on a single device.


Eric


--
------------------------------------------------------------------------
Eric Anderson Sr. Systems Administrator Centaur Technology
Anything that works is better than anything that doesn't.
------------------------------------------------------------------------

Steven Hartland

unread,
Oct 2, 2005, 2:25:20 PM10/2/05
to Arne Wörner, Patrick Proniewski, freebsd-p...@freebsd.org
----- Original Message -----
From: "Arne Wörner" <arne_w...@yahoo.com>
> That seems to be 2 or about 2 times faster than disc->disc
> transfer... But still slower, than I would have expected...
> SATA150 sounds like the drive can do 150MB/sec...

LOL, you might want to read up on what SATA150 means.
In short it the max throughput the interface can sustain. It is NOT
what you can get of a single disk which is still fare from that,
SATA disk transfer rates typically 30 -> 50MB/s sustained.

Steve

Patrick Proniewski

unread,
Oct 3, 2005, 3:55:49 AM10/3/05
to Eric Anderson, freebsd-p...@freebsd.org, Arne \"Wörner\"
Hi Arne and Eric,

>>> # atacontrol mode 0
>>> Master = BIOSPIO
>>> Slave = BIOSPIO

>> Hmm... 0 seems to be the wrong ata... Thats why the output does
>> not fit to SATA drives, I think...

oups... I'll have to do it again with channels 2 and 3


>>> # dd if=/dev/ad4 of=/dev/null bs=1m count=1000
>>> 1000+0 records in
>>> 1000+0 records out
>>> 1048576000 bytes transferred in 17.647464 secs (59417943
>>> bytes/sec)

>> That seems to be 2 or about 2 times faster than disc->disc
>> transfer... But still slower, than I would have expected...
>> SATA150 sounds like the drive can do 150MB/sec...

As Eric pointed out, you just can"t reach 150 MB/s with one disk,
it's a technological maximum for the bus, but real world performance
is well bellow this max.
In fact, I've though I would reach about 50 to 60 MB/s.

>>>> (Maybe you could find a way to copy /dev/zero to /dev/ad6
>>>> without destroying the previous work... :-))
>>>
>>> well, not very easy both disk are the same size ;)

>> I thought of the first 1000 1MB blocks... :-)

damn, I misread this one... :)
I'm gonna try this asap.


> Instead of dd, why not use gmirror?

I had no idea gmirror exists, but I'll continue with dd. It's a one
time experiment.


> Have you tried a smaller block size? What does 8k, 16k, or 512k do
> for you? There really isn't much room for improvement here on a
> single device.

nop, I'll try one of them, but I can't do many experiments, the box
is in my living room, it's a 1U rack, and it's VERY VERY noisy. My
girlfriend will kill me if it's running more than an hour a day :))


Pat

Bruce Evans

unread,
Oct 3, 2005, 10:21:15 AM10/3/05
to Patrick Proniewski, freebsd-p...@freebsd.org, Eric Anderson, Arne \"Wörner\"
On Mon, 3 Oct 2005, Patrick Proniewski wrote:

>>>> # dd if=/dev/ad4 of=/dev/null bs=1m count=1000
>>>> 1000+0 records in
>>>> 1000+0 records out
>>>> 1048576000 bytes transferred in 17.647464 secs (59417943
>>>> bytes/sec)

Many wrong answers to the original question have been given. dd with
a blocks size of 1m between (separate) disk devices is much slower
just because that block size is far too large...

The above is a fairly normal speed. The expected speed depends mainly
on the disk technology generation and the placement of the sectors being
read. I get the following speeds for _sequential_ _reading- from the
outer (fastest) tracks of 6- and 3-year old drives which are about 2
generations apart:

%%%
Sep 25 21:52:35 besplex kernel: ad0: 29314MB <IBM-DTLA-307030> [59560/16/63] at ata0-master UDMA100
Sep 25 21:52:35 besplex kernel: ad2: 58644MB <IC35L060AVV207-0> [119150/16/63] at ata1-master UDMA100
ad0 bs 512: 16777216 bytes transferred in 2.788209 secs (6017201 bytes/sec)
ad0 bs 1024: 16777216 bytes transferred in 1.433675 secs (11702245 bytes/sec)
ad0 bs 2048: 16777216 bytes transferred in 0.787466 secs (21305320 bytes/sec)
ad0 bs 4096: 16777216 bytes transferred in 0.479757 secs (34970249 bytes/sec)
ad0 bs 8192: 16777216 bytes transferred in 0.477803 secs (35113250 bytes/sec)
ad0 bs 16384: 16777216 bytes transferred in 0.462006 secs (36313842 bytes/sec)
ad0 bs 32768: 16777216 bytes transferred in 0.462038 secs (36311331 bytes/sec)
ad0 bs 65536: 16777216 bytes transferred in 0.486850 secs (34460748 bytes/sec)
ad0 bs 131072: 16777216 bytes transferred in 0.462046 secs (36310693 bytes/sec)
ad0 bs 262144: 16777216 bytes transferred in 0.469866 secs (35706382 bytes/sec)
ad0 bs 524288: 16777216 bytes transferred in 0.462035 secs (36311555 bytes/sec)
ad0 bs 1048576: 16777216 bytes transferred in 0.478534 secs (35059612 bytes/sec)
ad2 bs 512: 16777216 bytes transferred in 4.115675 secs (4076419 bytes/sec)
ad2 bs 1024: 16777216 bytes transferred in 2.105451 secs (7968466 bytes/sec)
ad2 bs 2048: 16777216 bytes transferred in 1.132157 secs (14818809 bytes/sec)
ad2 bs 4096: 16777216 bytes transferred in 0.662452 secs (25325935 bytes/sec)
ad2 bs 8192: 16777216 bytes transferred in 0.454654 secs (36901065 bytes/sec)
ad2 bs 16384: 16777216 bytes transferred in 0.304761 secs (55050416 bytes/sec)
ad2 bs 32768: 16777216 bytes transferred in 0.304761 secs (55050416 bytes/sec)
ad2 bs 65536: 16777216 bytes transferred in 0.304765 secs (55049683 bytes/sec)
ad2 bs 131072: 16777216 bytes transferred in 0.304762 secs (55050200 bytes/sec)
ad2 bs 262144: 16777216 bytes transferred in 0.304760 secs (55050588 bytes/sec)
ad2 bs 524288: 16777216 bytes transferred in 0.304762 secs (55050200 bytes/sec)
ad2 bs 1048576: 16777216 bytes transferred in 0.304757 secs (55051148 bytes/sec)
%%%

Drive technology hit a speed plateau a few years ago so newer single drives
aren't much faster unless they are more expensive and/or smaller.

The speed is low for small block sizes because the device has to be
talked too too much and the protocol and firmware are not very good.
(Another drive, a WDC 120GB with more cache (8MB instead of 2), ramps
up to about half speed (26MB/sec) for a block size of 4K but sticks
at that speed for block sizes 8K and 16K, then jumps up to full speed
for a block sizes of 32K and larger. This indicates some firmware
stupidness). Most drives ramp up almost logarithmically (doubling
the block size almost doubles the speed). This behaviour is especially
evident on slow SCSI drives like some (most?) ZIP and dvd/cd. The
command overhead can be 20 msec, so you had better not do 1 512 bytes
of i/o per command or you will get a speed of 25K/sec. The command
overhead of a new ATA drive is more like 50 usec, but that is still
far too much for high speed with a block size of 512 bytes.

The speed is insignificantly different for block sizes larger than a
limit because the drive's physical limits dominate except possibly
with old (slow) CPUs.

>>> That seems to be 2 or about 2 times faster than disc->disc
>>> transfer... But still slower, than I would have expected...
>>> SATA150 sounds like the drive can do 150MB/sec...
>
> As Eric pointed out, you just can"t reach 150 MB/s with one disk, it's a
> technological maximum for the bus, but real world performance is well bellow
> this max.
> In fact, I've though I would reach about 50 to 60 MB/s.

50-60 MB/s is about right. I haven't benchmarked any SATA or very new
drives. Apparently they are not much faster. ISTR that WDC Raptors are
speced for 70-80MB/sec. You pay twice as much to get a tiny drive with
only 25% more throughput plus faster seeks.

>>>>> (Maybe you could find a way to copy /dev/zero to /dev/ad6
>>>>> without destroying the previous work... :-))
>>>>
>>>> well, not very easy both disk are the same size ;)
>
>>> I thought of the first 1000 1MB blocks... :-)
>
> damn, I misread this one... :)
> I'm gonna try this asap.

I divide disks into equally sized (fairly small, or half the disk size)
partitions, and cp between them. dd is too hard to use for me ;-). cp
is easier to type and automatically picks a reasonable block size. Of
course I use dd if the block size needs to be controlled, but mostly I
only use it in preference to cp to get its timing info.

>...


>> Have you tried a smaller block size? What does 8k, 16k, or 512k do for
>> you? There really isn't much room for improvement here on a single device.
>
> nop, I'll try one of them, but I can't do many experiments, the box is in my
> living room, it's a 1U rack, and it's VERY VERY noisy. My girlfriend will
> kill me if it's running more than an hour a day :))

Smaller block sizes will go much faster, except for copying from a disk to
itself. Large block sizes are normally a pessimization and the pessimization
is especially noticeable for dd. Just use the smallest block size that gives
an almost-maximal throughput (e.g., 16K for reading ad2 above, possibly
different for writing). Large block sizes are pessimal for synchronous
i/o like dd does. The timing for dd'ing blocks of size N MB at R MB/sec
between ad0 and ad2 is something like:

time in secs activity on ad0 activity on ad2
------------ --------------- ---------------
0 start read of 1MB idle
N/R finish read; idle start write of 1MB
N/R-epsilon start read of 1MB pretend to complete write
N/R continue read complete write
N/R-epsilon finish read; idle start write of 1MB
N/R-2*epsilon ... ...

After the first block (which takes a little longer), it takes N/R-epsilon
seconds to copy 1 block, where epsilon is the time between the writer's
pretending to complete the write and actually completing it. This time
is obviously not very dependent on the block size since it is limited by
drives resources and policies (in particular, if the drive doesn't do write
caching, perhaps because write caching is not enabled, then epsilon is 0,
and if out block size is large compared with the drive's cache then the
drive won't be able to signal completion until no more than the drive's
cache size is left to do). Thus epsilon becomes small relative to the
N/R term when N is large. Apparently, in your case the speed drops from
59MB/sec to 35MB/sec, so with N == 1 and R == 59, epsilon is about 1/200.

With large block sizes, the speed can be increased using asyncronous output.
There is a utility (in ports) named team that fakes async output using
separate processes. I have never used it. Somthing as simple as 2
dd's in a pipe should work OK.

For copying from a disk itself, a large block sizes is needed to limit the
number of seeks, and concurrent reads and writes are exactly what is not
needed (since they would give competing seeks). The i/o must be
sequentialized, and dd does the right things for this, though the drive
might not (you would prefer epsilon == 0, since if the drive signals
write completion early then it might get confused when you flood it
with the next read and seek to start the read before it completes the
write, then thrash back and forth between writing and reading).

It is interesting that writing large sequential files to at least the
ffs file system (not mounted with -sync) in FreeBSD is slightly faster
than writing directly to the raw disk using write(2), even if the
device driver sees almost the same block sizes for these different
operations. This is because write(2) is synchronous and sync writes
always cause idle periods (the idle periods are just much smaller for
writing data that is already in memory), while the kernel uses async
writes for data.

Bruce

Tulio Guimarães da Silva

unread,
Oct 3, 2005, 10:57:14 AM10/3/05
to freebsd-p...@freebsd.org
Steven Hartland wrote:

> ----- Original Message ----- From: "Arne Wörner" <arne_w...@yahoo.com>
>
>> That seems to be 2 or about 2 times faster than disc->disc
>> transfer... But still slower, than I would have expected...
>> SATA150 sounds like the drive can do 150MB/sec...
>
>
> LOL, you might want to read up on what SATA150 means.
> In short it the max throughput the interface can sustain. It is NOT
> what you can get of a single disk which is still fare from that,
> SATA disk transfer rates typically 30 -> 50MB/s sustained.
>
> Steve

Indeed. In other words, that represents the max transfer rates between
the SATA controller and the disk´s controller (at best, you´ll get close
to it when reading from the disk´s onboard cache), but the media will
always be much slower.
But just to clear out some questions...
1) Maxtor´s full specifications for Diamond Max+ 9 Series refers to
maximum *sustained* transfer rates of 37MB/s and 67MB/s for "ID" and
"OD", respectively (though I couldn´d find exactly what it means, I
deduced that represents the rates for center- and border-parts of the
disk - please correct me if I´m wrong), then your tests show you´re
getting the best out of it ;) ;
2) Mr. Hartland mentioned the numbers to be good for a single drive,
therefore it´s a bit better for a disk-to-disk, where the limit should
be the slower disk´s performance. I couldn´t look for the specs of the
Toshiba since I didn´t have the exact model, but I would expect it to be
equal or faster than the Maxtor, since it does not appear to be a
bottleneck.

One last thought, though, for the specialists: iostat showed maximum
of 128KB/transfer, even though dd should be using 1MB blocks... is that
an expected behaviour? Shouldn´t iostat show 1024Kb/t, then?
Thanks for your attention,

Tulio G. da Silva

Tulio Guimarães da Silva

unread,
Oct 3, 2005, 11:08:31 AM10/3/05
to freebsd-p...@freebsd.org
Phew, thanks for that. :) This seems to answer my question in the
other "leg" of the thread, though it hadn´t yet arrived to me when I
wrote the message, though.
Now THAT´s a quite good explanation. ;) Thanks again,

Tulio G. da Silva

Bruce Evans wrote:

Bruce Evans

unread,
Oct 3, 2005, 8:48:48 PM10/3/05
to Tulio Guimarães da Silva, freebsd-p...@freebsd.org
On Mon, 3 Oct 2005, [ISO-8859-1] Tulio Guimarães da Silva wrote:

> But just to clear out some questions...
> 1) Maxtor´s full specifications for Diamond Max+ 9 Series refers to maximum
> *sustained* transfer rates of 37MB/s and 67MB/s for "ID" and "OD",
> respectively (though I couldn´d find exactly what it means, I deduced that
> represents the rates for center- and border-parts of the disk - please
> correct me if I´m wrong), then your tests show you´re getting the best out of
> it ;) ;

> much slower.

Another interesting point is that you can often get closer to the maximum
rate than the average of the maximum and minumum rate. The outer tracks
contain more sectors (about 67/37 times as many with the above spec), so
the average rate over all sectors is larger than average of the max and min,
significantly so since 67/37 is a fairly large fraction. Also, you can
often partition disks to put less-often accessed stuff in the slow parts.

> One last thought, though, for the specialists: iostat showed maximum of
> 128KB/transfer, even though dd should be using 1MB blocks... is that an
> expected behaviour? Shouldn´t iostat show 1024Kb/t, then?

The expected size is 64K. 128KB is due to a bug in GEOM, one that was
fixed a couple of days ago by tegge@.

iostat shows the size that reaches the disk driver. The best size to
show is the size that reaches the disk hardware, but several layers
of abstraction, some excessive, make it impossible to show that size:

First there is the disk firmware layer above the disk hardware layer.
There is no way for the driver to know exacly what the firmware layer
is doing. Good firmware will cluster i/o's and otherwise cache things
to minimize seeks and other disk accesses, in much the same way that
a good OS will do, but hopefully better because it can understand the
hardware better and use more specialized algorithms.

Next there is the driver layer. Drivers shouldn't split up i/o, but
some at least used to, and they now cannot report such splitting to
devstat. I can't see any splitting in the ad driver now -- I can only
see reduction of the max size from 255 to 128 sectors in the non-DMA
case, and the misnamed struct member atadev->max_iosize in this case
(this actually gives the max transfer size; in the DMA case, the max
transfer size is the same as the max i/o size, but in the non-DMA case
it is the number of sectors transferred per interrupt which is usually
much smaller than the max i/o size of DFLTPHYS = 64K). The fd driver
at least used to split up i/o into single sectors. 20-25 years ago
when CPUs were slow even compared with floppies, this used to be a
good way to pessimize i/o. A few years later, starting with about
386's, CPUs became fast enough to easily generate new requests in the
sector gap time so even poorly written fd drivers could keep floppies
streaming except across seeks to another track. The fd driver never
reported this internal splitting to devstat, and maybe never should
have since it is close enough to the hardware to know that this splitting
is normal and/or doesn't affect efficiency.

Next there is the GEOM layer. It splits up i/o's requested by the
next layer up according to the max size advertised by the driver. The
latter is typically DFLTPHYS = 64K and often unrelated to the hardware;
MAXPHYS = 128K would be better if the hardware can handle it. Until
a couple of days ago, reporting of this splitting was broken. GEOM
reported to devstat the size passed to it and not the size that it
passed to drivers. tegge@ fixed this.

For writes to raw disks, the next layer up is physread(). (Other cases
are even more complicated :-).) physread() splits up i/o's into blocks
of max size dev->si_iosize_max. This splitting is wrong for tape-like
devices but is almost harmless for disk-like devices. Another bug in
GEOM Is bitrot in the setting of dev->si_iosize_max. This should
normally be the same as the driver max size, and used to be set to the
same in in individual drivers in many cases including the ad driver,
but now most drivers don't set it and GEOM normally defaults it to
the bogus value MAXPHYS = 128K. physread() also defaults it, but to
the different, safer, value DFLTPHYS = 64K. The different max sizes
cause excessive splitting. See below for examples.

For writes by dd, there are a few more layers (driver read, devfs read,
and write(2) at least).

So for writes of 1M from dd to an ad device with DMA enabled and the
normal DMA size of 64K, the following reblocking occurs:

1M is split into 8*128K by physio() since dev->si_iosize_max is 128K
8*128K is split into 16*64K by GEOM since dp->d_maxsize is mismatched (64K)

dp->max_size is 63K for a couple of controllers in the DMA case and possibly
always for the acd driver (see the magic 65534 in atapi-cd.c). Then the
bogus splitting is more harmful:

1M is split into 8*128K by physio() (no difference)
8*128K is split into 8 * (2*63K + 1*2K) by GEOM

The 1*2K splitting is especially pessimal. The afd driver used to have
this bug internally, and still has it in RELENG_4. Its max i/o (DMA)
size was 32K for ZIP disks that seem to be IOMEGA ones and 126K for
other drives. dd'ing to ZIP drives was fast enough if you used a size
smaller than the max i/o size (but not very small), or with nice power
of 2 sizes for disks that seem to be IOMEGA ones, but a nice size of
128K caused the following bad splitting for non-IOMEGA ones:
128K = 1*126K + 1*2K. Since accesses to ZIP disks take about 20 msec
per access, the 2K-block almost halved the transfer speed.

The normal ata DMA size of 64*1024 is also too magic -- it just happens
to equal DFLTPHYS so it only causes 1 bogus splitting in combination
with the other bugs.

For writes by dd, these bugs are easy to avoid if you know about them or
if you just fear them and test all reasonable block sizes to find the best
one. Just use a block size large enough to be efficient but small enough
to not cause splitting, or in cases where the mismatches are only off-by-a
factor-of 2^n, large enough to cause even splitting.

For cases other than writes by dd, the bugs cause pessimal splitting.
E.g., file system clustering uses yet another bogusly intitialized max
i/o size, vp->v_mount->mnt_iosize_max. This defaults to DFLTPHYS =
64K in the top vfs layer, but many file systems, including ffs, set
it to devvp->v_rdev->si_iosize_max, so it is normally set to the wrong
default set for the latter by GEOM, MAXPHYS = 128K. This normally
causes excessive splitting which is especially harmful if the driver's
max is not a divisor of MAXPHYS. E.g., when the driver's max is 63K,
writing a 256KB file to an ffs file system with the default fs-block
size of 16K causes the following bogus splitting even if ffs allocates
all the blocks optimally (contiguously):

At ffs level:
12*16K (direct data blocks)
1*16K (indirect block; but ffs usually gets this wrong and doesn't
allocate it contiguously)
4*16K (data blocks indirected through the indirect block)

At clustering level:
17*16K reblocked to 2*128K + 1*16K

At device driver level:
2*128K + 1*16K split into 63K, 63K, 2K, 63K, 63K, 2K, 16K

So splitting almost half undoes the gathering done by the clustering
level (we start with 17 blocks and end with 7). Ideally we would end
with 5 (4*63K + 1*20K).

Caching in not-very-old drives (but not ZIP or CD/DVD ones) makes
stupid blocking not very harmful for reads, but doesn't help so much
for writes.

Bruce

Patrick Proniewski

unread,
Oct 5, 2005, 1:30:24 PM10/5/05
to freebsd-p...@freebsd.org
Hi,

thank you all for these interesting explanations.

I've made some more tests with my disks :
As you'll see, for block size greater than 64k, the HDD ad6 (hitachi)
is the bottleneck.
bs of 1m and 512k yield to best transfert rates between ad4 and ad6
and using a pipe between to dd will lower the performance.

best regards, and thank you again,

Pat,

#### /dev/zero to ad6

# dd if=/dev/zero of=/dev/ad6 bs=1m count=1000


1000+0 records in
1000+0 records out

1048576000 bytes transferred in 31.047655 secs (33773114 bytes/sec)

# dd if=/dev/zero of=/dev/ad6 bs=8k count=128000
128000+0 records in
128000+0 records out
1048576000 bytes transferred in 31.580223 secs (33203565 bytes/sec)

#### ad4 (SATA150) to ad6 (SATA150)

# dd if=/dev/ad4 of=/dev/ad6 bs=8k count=128000
128000+0 records in
128000+0 records out
1048576000 bytes transferred in 50.916216 secs (20594146 bytes/sec)

# dd if=/dev/ad4 of=/dev/ad6 bs=64k count=16000
16000+0 records in
16000+0 records out
1048576000 bytes transferred in 30.925397 secs (33906630 bytes/sec)

# dd if=/dev/ad4 of=/dev/ad6 bs=128k count=8000
8000+0 records in
8000+0 records out
1048576000 bytes transferred in 31.462153 secs (33328170 bytes/sec)

# dd if=/dev/ad4 of=/dev/ad6 bs=256k count=4000
4000+0 records in
4000+0 records out
1048576000 bytes transferred in 30.819234 secs (34023428 bytes/sec)

# dd if=/dev/ad4 of=/dev/ad6 bs=512k count=2000
2000+0 records in
2000+0 records out
1048576000 bytes transferred in 30.589651 secs (34278783 bytes/sec)

# dd if=/dev/ad4 of=/dev/ad6 bs=1m count=1000


1000+0 records in
1000+0 records out

1048576000 bytes transferred in 30.660553 secs (34199514 bytes/sec)

# dd if=/dev/ad4 bs=1m count=1000 | dd of=/dev/ad6 bs=1m


1000+0 records in
1000+0 records out

1048576000 bytes transferred in 33.998716 secs (30841635 bytes/sec)
0+16000 records in
0+16000 records out
1048576000 bytes transferred in 34.001099 secs (30839474 bytes/sec)


Willem Jan Withagen

unread,
Oct 12, 2005, 8:14:11 PM10/12/05
to Bruce Evans, freebsd-p...@freebsd.org, Tulio Guimarães da Silva
Bruce Evans wrote:

> On Mon, 3 Oct 2005, [ISO-8859-1] Tulio Guimar�es da Silva wrote:
>
>> But just to clear out some questions...
>> 1) Maxtor�s full specifications for Diamond Max+ 9 Series refers to
>> maximum *sustained* transfer rates of 37MB/s and 67MB/s for "ID" and
>> "OD", respectively (though I couldn�d find exactly what it means, I
>> deduced that represents the rates for center- and border-parts of the
>> disk - please correct me if I�m wrong), then your tests show you�re
>> getting the best out of it ;) ;
>> much slower.
>
>
> Another interesting point is that you can often get closer to the maximum
> rate than the average of the maximum and minumum rate. The outer tracks
> contain more sectors (about 67/37 times as many with the above spec), so
> the average rate over all sectors is larger than average of the max and
> min,
> significantly so since 67/37 is a fairly large fraction. Also, you can
> often partition disks to put less-often accessed stuff in the slow parts.
>

[All GEOM alligning deleted]

As it so happens, I have again some (faster) spare servers in my office.
And given the NFS-tests of last year, I want to see if I could run those
tests again. But before doing so I wanted to verify the extent of what Bruce
suggest here above. (Which I found first in an article some time ago)

I've written a small, not yet complete page, on the topic. At current it only
involves writting to the disk. But it clearly visualises the effect of
non-constant transferrates, which actually depends on the location of the
track read from.

If you want, you could see for yourself at:
http://withagen.dyndns.org/FreeBSD/Performance/Raw-disk/
Suggestions etc. are welcome.

--WjW

aleksande...@gmail.com

unread,
Oct 16, 2005, 2:32:01 AM10/16/05
to
> Is it normal that data rate won't go upper than 35/38 MB/s ?
I think this is normal
Just to show what You can get using fastest SATA drives in RAID0 array:

freehost# dd if=/dev/ar0 of=/dev/null bs=1m
^C81193+0 records in
81193+0 records out
85137031168 bytes transferred in 800.223549 secs (106391559 bytes/sec)
.......
Oct 13 22:14:19 freehost kernel: atapci1: <Intel ICH7 SATA150
controller> port 0
x20c8-0x20cf,0x20ec-0x20ef,0x20c0-0x20c7,0x20e8-0x20eb,0x20a0-0x20af
mem 0x501c4
000-0x501c43ff irq 19 at device 31.2 on pci0
Oct 13 22:14:19 freehost kernel: ata2: <ATA channel 0> on atapci1
Oct 13 22:14:19 freehost kernel: ata3: <ATA channel 1> on atapci1
Oct 13 22:14:19 freehost kernel: ata4: <ATA channel 2> on atapci1
Oct 13 22:14:19 freehost kernel: ata5: <ATA channel 3> on atapci1
.....
Oct 13 22:14:19 freehost kernel: ad4: 70911MB <WDC WD740GD-00FLC0
33.08F33> at a
ta2-master SATA150
Oct 13 22:14:19 freehost kernel: ad4: Intel calc=e4ea2ca2 meta=f2751651
Oct 13 22:14:19 freehost kernel: ad6: 70911MB <WDC WD740GD-00FLC0
33.08F33> at a
ta3-master SATA150
Oct 13 22:14:19 freehost kernel: ad6: Intel calc=e4ea2ca2 meta=f2751651
Oct 13 22:14:19 freehost kernel: ar0: 141817MB <Intel MatrixRAID RAID0
(stripe 1
28 KB)> status: READY
Oct 13 22:14:19 freehost kernel: ar0: disk0 READY using ad4 at
ata2-master
Oct 13 22:14:19 freehost kernel: ar0: disk1 READY using ad6 at
ata3-master
....
# uname -a
FreeBSD freehost.tes.local 6.0-RC1 FreeBSD 6.0-RC1 #1: Thu Oct 13
17:38:05 YEKST 2005 root@:/usr/obj/usr/src/sys/SERVER6 i386

0 new messages