When writing to a logical volume (/dev/sys/test) directly through the
device, I obtain a slow performance:
root@dom0-2:/dev/mapper# dd of=/dev/sys/test if=/dev/zero
4580305+0 records in
4580305+0 records out
2345116160 bytes (2.3 GB) copied, 119.327 s, 19.7 MB/s
Making a file system on top of the LV, mounting it and write into a file
is ok:
root@dom0-2:/dev/mapper# mkfs.xfs /dev/sys/test
root@dom0-2:/mnt# mount /dev/sys/test /mnt/lv
root@dom0-2:/mnt# dd of=/mnt/lv/out if=/dev/zero
2647510+0 records in
2647510+0 records out
1355525120 bytes (1.4 GB) copied, 11.3235 s, 120 MB/s
Furthermore, by accident I noticed that writing directly to the block
device is oke when the LV is mounted (of course destroying the file
system on it):
root@dom0-2:/mnt# dd of=/dev/sys/test if=/dev/zero
3703375+0 records in
3703374+0 records out
1896127488 bytes (1.9 GB) copied, 15.4927 s, 122 MB/s
Does anyone know what is going on?
The configuration is as follows:
Debian 6.0.2
Kernel 2.6.32-5-xen-amd64
Tests are on a partition on one physical disk
Best regards,
Dion Kant
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/4E3F8173...@concero.nl
Yes. You lack knowledge of the Linux storage stack and of the dd
utility. Your system is fine. You are simply running an improper test,
and interpreting the results from that test incorrectly.
Google for more information on the "slow" results you are seeing.
--
Stan
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/4E3FE596...@hardwarefreak.com
Apparently the Debian kernel behaves differently with respect to this
"issue" from for example an openSUSE kernel, which does give symmetric
(near disk i/o limited) results.
What is the proper way to copy a (large) raw disk image onto a logical
volume?
Thanks for your advise to try Google. I already found a couple of posts
from people describing this similar issue, but no proper explanation yet.
Dion.
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Apparently you are Google challenged as well. Here:
http://lmgtfy.com/?q=lvm+block+size
> What is the proper way to copy a (large) raw disk image onto a logical
> volume?
See above, and do additional research into dd and "block size". It also
wouldn't hurt for you to actually read and understand the dd man page.
> Thanks for your advise to try Google. I already found a couple of posts
> from people describing this similar issue, but no proper explanation yet.
I already knew the answer, so maybe my search criteria is what allowed
me to "find" the answer for you in 20 seconds or less. I hate spoon
feeding people, as spoon feeding is antithetical to learning and
remembering. Hopefully you'll learn something from this thread, and
remember it. :)
--
Stan
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/4E40B1AE...@hardwarefreak.com
BTW, you didn't mentioned what disk drive is in use in this test. Is it
an Advanced Format drive? If so, and your partitions are unaligned,
this in combination with no dd block size being specified will cause
your 10x drop in your dd "test". The wrong block size alone shouldn't
yield a 10x drop, more like 3-4x. Please state the model# of the disk
drive, and the partition table using:
/# hdparm -I /dev/sdX
/# fdisk -l /dev/sdX
Lemme guess, this is one of those POS cheap WD Green drives, isn't it?
Just in case, read this too:
This document applies to *all* Advanced Format drives, not strictly
those sold by Western Digital.
--
Stan
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/4E40B7E1...@hardwarefreak.com
Thanks for your remarks. The disk info is given below. Writing to the
disk is oke when mounted, so I think it is not a hardware/alignment
issue. However your remarks made me do some additional investigations:
1. dd of=/dev/sdb4 if=/dev/zero gives similar results, so it has nothing
to do with LVM;
2. My statement about writing like this on an openSUSE kernel is wrong.
Also with openSUSE and the same hardware I get similar (slow) results
when writing to the disk using dd via the device file.
So now the issue has diverted to the asymmetric behaviour when
writing/reading using dd directly through the (block) device file.
Reading with dd if=/dev/sdb4 of=/dev/null gives disk limited performance
Writing with dd of=/dev/sdb4 if=/dev/zero gives about a factor 10 less
performance.
However, after mounting a file system on sdb4 (read only), I can use dd
of=/dev/sdb4 if=/dev/zero at (near) disk limited performance.
Now I used this trick to copy a large (raw) disk image onto an LVM
partition. I think this is odd. Can somebody explain why this is like it is?
Here is the disk info:
Model Family: Seagate Barracuda ES
Device Model: ST3750640NS
root@dom0-2:~# fdisk -l /dev/sdb
Disk /dev/sdb: 750.2 GB, 750156374016 bytes
255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000eae95
Device Boot Start End Blocks Id System
/dev/sdb1 1 244 1951744 fd Linux raid
autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdb2 244 280 292864 fd Linux raid
autodetect
Partition 2 does not end on cylinder boundary.
/dev/sdb3 280 7575 58593280 fd Linux raid
autodetect
Partition 3 does not end on cylinder boundary.
/dev/sdb4 7575 91202 671734784 fd Linux raid
autodetect
Partition 4 does not end on cylinder boundary.
root@dom0-2:~# hdparm -I /dev/sdb
/dev/sdb:
ATA device, with non-removable media
Model Number: ST3750640NS
Serial Number: 5QD193MQ
Firmware Revision: 3.AEK
Standards:
Supported: 7 6 5 4
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 1465149168
Logical Sector size: 512 bytes
Physical Sector size: 512 bytes
device size with M = 1024*1024: 715404 MBytes
device size with M = 1000*1000: 750156 MBytes (750 GB)
cache/buffer size = 16384 KBytes
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Advanced power management level: 254
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=240ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
* Advanced Power Management feature set
SET_MAX security extension
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
64-bit World wide name
Time Limited Commands (TLC) feature set
Command Completion Time Limit (CCTL)
* Gen1 signaling speed (1.5Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT LBA Segment Access (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
not supported: enhanced erase
Logical Unit WWN Device Identifier: 0000000000000000
NAA : 0
IEEE OUI : 000000
Unique ID : 000000000
Checksum: correct
Dion
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
> Thanks for your remarks. The disk info is given below. Writing to the
> disk is oke when mounted, so I think it is not a hardware/alignment
> issue. However your remarks made me do some additional investigations:
>
> 1. dd of=/dev/sdb4 if=/dev/zero gives similar results, so it has nothing
> to do with LVM;
> 2. My statement about writing like this on an openSUSE kernel is wrong.
> Also with openSUSE and the same hardware I get similar (slow) results
> when writing to the disk using dd via the device file.
>
> So now the issue has diverted to the asymmetric behaviour when
> writing/reading using dd directly through the (block) device file.
>
> Reading with dd if=/dev/sdb4 of=/dev/null gives disk limited performance
> Writing with dd of=/dev/sdb4 if=/dev/zero gives about a factor 10 less
> performance.
Run:
/$ dd of=/dev/sdb4 if=/dev/zero bs=4096 count=500000
Then run again with bs=512 count=2000000
That will write 2GB in 4KB blocks and will prevent dd from trying to
buffer everything before writing it. You don't break out of this--it
finishes on it's own due to 'count'. The second run with use a block
size of 512B, which is the native sector size of the Seagate disk.
Either of these should improve your actual dd performance dramatically.
When you don't specify a block size with dd, dd attempts to "buffer" the
entire input stream, or huge portions of it, into memory before writing
it out. If you look at RAM, swap usage, and disk IO while running your
'raw' dd test, you'll likely see both memory, and IO to the swap device,
are saturated, with little actual data being written to the target disk
partition.
I attempted to nudge you into finding this information on your own, but
you apparently did not. I explained all of this not long ago, either
here or on the linux-raid list. It should be in Google somewhere.
Never use dd without specifying the proper block size of the target
device--never. For a Linux filesystem this will be 4096 and for a raw
hard disk device it will be 512, optimally anyway. Other values may
give better performance, depending on the system, the disk controller,
and device driver, etc.
That Seagate isn't an AF model so sector alignment isn't the issue here,
just improper use of dd.
--
Stan
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/4E416AA5...@hardwarefreak.com
You are right, with bs=4096 the write performance improves
significantly. From the man page of dd I concluded that not specifying
bs selects ibs=512 and obs=512. A bs=512 gives indeed similar
performance as not specifying bs at all.
When observing the system with vmstat I see the same (strange) behaviour
for no bs specified, or bs=512:
root@dom0-2:~# vmstat 2
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
1 0 0 6314620 125988 91612 0 0 0 3 5 5 0 0
100 0
1 1 0 6265404 173744 91444 0 0 23868 13 18020 12290 0
0 86 14
2 1 0 6214576 223076 91704 0 0 24666 1 18596 12417 0
0 90 10
0 1 0 6163004 273172 91448 0 0 25046 0 18867 12614 0
0 89 11
1 0 0 6111308 323252 91592 0 0 25042 0 18861 12608 0
0 92 8
0 1 0 6059860 373220 91648 0 0 24984 0 18821 12578 0
0 85 14
0 1 0 6008164 423304 91508 0 0 25040 0 18863 12611 0
0 95 5
2 1 0 5956344 473468 91604 0 0 25084 0 18953 12630 0
0 95 5
0 1 0 5904896 523548 91532 0 0 25038 0 18867 12607 0
0 87 13
0 1 0 5896068 528680 91520 0 0 2558 99597 2431 1373 0 0
92 8
0 2 0 5896088 528688 91520 0 0 0 73736 535 100 0 0
86 13
0 1 0 5896128 528688 91520 0 0 0 73729 545 99 0 0
88 12
1 0 0 6413920 28712 91612 0 0 54 2996 634 372 0 0
95 4
0 0 0 6413940 28712 91520 0 0 0 0 78 80 0 0
100 0
0 0 0 6413940 28712 91520 0 0 0 0 94 97 0 0
100 0
Remarkable behaviour in the sense that there is a lot of bi in the
beginning and finally I see bo at 75 MB/s.
With obs=4096 it looks like
root@dom0-2:~# vmstat 2
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
1 0 0 6413600 28744 91540 0 0 0 3 5 5 0 0
100 0
1 0 0 6413724 28744 91540 0 0 0 0 103 96 0 0
100 0
1 0 0 6121616 312880 91208 0 0 0 18 457 133 1 2
97 0
0 1 0 5895588 528756 91540 0 0 0 83216 587 88 1 3
90 6
0 1 0 5895456 528756 91540 0 0 0 73728 539 98 0 0
92 8
0 3 0 5895400 528760 91536 0 0 0 73735 535 93 0 0
86 14
1 0 0 6413520 28788 91436 0 0 54 19359 783 376 0 0
93 6
0 0 0 6413544 28788 91540 0 0 0 2 100 84 0 0
100 0
0 0 0 6413544 28788 91540 0 0 0 0 86 87 0 0
100 0
0 0 0 6413552 28796 91532 0 0 0 10 110 113 0 0
100 0
As soon as I select a bs which is not a whole multiple of 4096, I get a
lot of block input and a bad performance for writing data to disk.
I'll try to Google your mentioned thread(s) on this. I still feel not
very satisfied with your explanation though.
Thanks so far,
Dion
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
That might be due to massive merges, but I'm not really a kernel hacker
so I can't say for sure.
My explanation to you wasn't fully correct. I confused specifying no
block size with specifying an insanely large block size. The other post
I was referring to dealt with people using a 1GB (or larger) block size
because it made the math easier for them when wanting to write a large
test file.
Instead of dividing their total file size by 4096 and using the result
for "bs=4096 count=X" (which is the proper method I described to you)
they were simply specifying, for example, "bs=2G count=1" to write a 2
GB test file. Doing this causes the massive buffering I described, and
consequently, horrible performance, typically by a factor of 10 or more,
depending on the specific system.
The horrible performance with bs=512 is likely due to the LVM block size
being 4096, and forcing block writes that are 1/8th normal size, causing
lots of merging. If you divide 120MB/s by 8 you get 15MB/s, which IIRC
from your original post, is approximately the write performance you were
seeing, which was 19MB/s.
If my explanation doesn't seem thorough enough that's because I'm not a
kernel expert. I'm just have a little better than average knowledge/
understanding of some of aspects of the kernel.
If you want a really good explanation of the reasons behind this dd
block size behavior while writing to a raw LVM device, try posting to
lkml proper or one of the sub lists dealing with LVM and the block
layer. Also, I'm sure some of the expert developers on the XFS list
could answer this as well, though it would be a little OT there, unless
of course your filesystem test yielding the 120MB/s was using XFS. ;)
--
Stan
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/4E468237...@hardwarefreak.com
[…]
> The horrible performance with bs=512 is likely due to the LVM block
> size being 4096, and forcing block writes that are 1/8th normal size,
> causing lots of merging. If you divide 120MB/s by 8 you get 15MB/s,
> which IIRC from your original post, is approximately the write
> performance you were seeing, which was 19MB/s.
I'm not an expert in that matter either, but I don't seem to
recall that LVM uses any “blocks”, other than, of course, the
LVM “extents.”
What's more important in my opinion is that 4096 is exactly the
platform's page size.
--cut: vgcreate(8) --
-s, --physicalextentsize PhysicalExtentSize[kKmMgGtT]
Sets the physical extent size on physical volumes of this volume
group. A size suffix (k for kilobytes up to t for terabytes) is
optional, megabytes is the default if no suffix is present. The
default is 4 MB and it must be at least 1 KB and a power of 2.
--cut: vgcreate(8) --
[…]
--
FSF associate member #7257
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/864o1l8...@gray.siamics.net
To use a water analogy, an extent is a pool used for storing data. It
has zero to do with transferring the payload. A block is a bucket used
to carry data to and from the pool.
If one fills his bucket only 1/8th full, it will take 8 times as many
trips (transfers) to fill the pool vs carrying a full bucket each time.
This is inefficient. This is a factor in the OP's problem. This is a
very coarse analogy, and maybe not the best, but gets the overall point
across.
The LVM block (bucket) size is 4kB, which yes, does match the page size,
which is important. It also matches the default filesystem block size
of all Linux filesystems. This is not coincidence. Everything in Linux
is optimized around a 4kB page size, whether memory management or IO.
And to drive the point home that this isn't an LVM or RAID problem, but
a proper use of dd problem, here's a demonstration of the phenomenon on
a single low end internal 7.2k SATA disk w/16MB cache, with a partition
formatted with XFS, write barriers enabled:
t$ dd if=/dev/zero of=./test1 bs=512 count=1000000
512000000 bytes (512 MB) copied, 16.2892 s, 31.4 MB/s
t$ dd if=/dev/zero of=./test1 bs=1024 count=500000
512000000 bytes (512 MB) copied, 10.5173 s, 48.7 MB/s
$ dd if=/dev/zero of=./test1 bs=2048 count=250000
512000000 bytes (512 MB) copied, 7.77854 s, 65.8 MB/s
$ dd if=/dev/zero of=./test1 bs=4096 count=125000
512000000 bytes (512 MB) copied, 6.64778 s, 77.0 MB/s
t$ dd if=/dev/zero of=./test1 bs=8192 count=62500
512000000 bytes (512 MB) copied, 6.10967 s, 83.8 MB/s
$ dd if=/dev/zero of=./test1 bs=16384 count=31250
512000000 bytes (512 MB) copied, 6.11042 s, 83.8 MB/s
This test system is rather old, having only 384MB RAM. I tested with
and without conv=fsync and the results are the same. This clearly
demonstrates that one should always use a 4kB block size with dd, WRT
HDDs and SSDs, LVM or mdraid, or hardware RAID. Floppy drives, tape,
and other slower devices probably need a different dd block size.
--
Stan
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/4E46A911...@hardwarefreak.com
> Instead of dividing their total file size by 4096 and using the result
> for "bs=4096 count=X" (which is the proper method I described to you)
> they were simply specifying, for example, "bs=2G count=1" to write a 2
> GB test file. Doing this causes the massive buffering I described, and
> consequently, horrible performance, typically by a factor of 10 or more,
> depending on the specific system.
>
> The horrible performance with bs=512 is likely due to the LVM block size
> being 4096, and forcing block writes that are 1/8th normal size, causing
> lots of merging. If you divide 120MB/s by 8 you get 15MB/s, which IIRC
> from your original post, is approximately the write performance you were
> seeing, which was 19MB/s.
Recall that I took LVM out of the loop already. So now I am doing the
experiment with writing data straight to the block device. In my case
/dev/sdb4. (If writing on the block device level does not perform, how
will LVM be able to perform?)
Inspired by your advice, I did some more investigation on this. I wrote
a small test program, i.e. taking dd out of the loop as well. It writes
1 GB test data with increasing block sizes directly to /dev/sdb4. Here
are some results:
root@dom0-2:~# ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
1 85.476 12.5619
2 33.016 32.5218
4 23.6675 45.3679
8 20.112 53.3881
16 18.76 57.2356
32 17.872 60.0795
64 17.636 60.8834
128 17.096 62.8064
256 17.188 62.4704
512 16.8482 63.7303
1024 57.6053 18.6396
2048 57.94 18.532
4096 17.016 63.1019
8192 16.604 64.6675
16384 16.452 65.2649
32768 17.132 62.6748
65536 16.256 66.052
131072 16.44 65.3127
262144 16.264 66.0194
524288 16.388 65.5199
The good and problematic block sizes do not really coincide with the
ones I observe with dd, but the odd behaviour is there. There are some
magic block sizes {1,1024, 2048} which cause a drop in performance.
Looking at vmstat output at the same time I see unexpected bi and the
interrupt rate goes sky high.
In my case it is the ahci driver handling the writes. Here is the vmstat
trace belonging to the bs=1 write and I add some more observations below:
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
0 0 0 6379780 23820 112616 0 0 0 0 78 82 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 77 80 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 79 80 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 78 82 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 76 80 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 77 83 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 75 80 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 82 82 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 90 93 0 0
100 0
1 0 0 6376796 27132 112524 0 0 828 4 400 531 0 0
100 0
1 0 0 6346416 57496 112560 0 0 7590 0 2408 3877 5 0
92 3
1 0 0 6315788 88048 112580 0 0 7638 0 2435 3903 7 0
90 3
1 1 0 6284416 118548 112540 0 0 7624 0 2428 3903 6 0
91 3
1 0 0 6253168 148896 112564 0 0 7586 0 2403 3875 6 0
91 3
1 0 0 6221920 179284 112484 0 0 7596 0 2414 3884 5 0
93 2
1 0 0 6190672 209648 112540 0 0 7590 0 2417 3877 5 0
93 2
0 1 0 6160540 239796 112536 0 0 7540 0 6240 3851 6 0
76 18
0 1 0 6129540 269952 112584 0 0 7538 0 6255 3856 6 0
86 8
1 0 0 6098292 300116 112504 0 0 7540 0 6233 3853 5 0
89 6
1 0 0 6067540 330280 112552 0 0 7538 0 6196 3857 6 0
87 7
1 0 0 6036540 360452 112536 0 0 7542 0 6281 3868 5 0
89 6
1 0 0 6005540 390608 112464 0 0 7540 0 6268 3856 6 0
85 8
1 0 0 5974292 420788 112516 0 0 7542 0 6246 3865 6 0
86 7
1 0 0 5943416 450952 112444 0 0 7540 0 6253 3860 5 0
88 6
1 0 0 5912540 481128 112488 0 0 7546 0 6226 3861 6 0
86 7
1 0 0 5881292 511300 112472 0 0 7540 0 6225 3860 5 0
89 6
1 0 0 5850292 541456 112464 0 0 7538 0 6192 3858 6 0
86 7
0 2 0 5817664 570260 112516 0 0 7200 40706 5990 4820 6 0
81 13
0 2 0 5789268 597752 112472 0 0 6870 0 5775 5251 5 0
80 15
1 1 0 5760996 625164 112676 0 0 6854 8192 5795 5248 5 0
73 21
1 1 0 5732476 653232 112572 0 0 7014 8192 5285 5362 5 0
82 13
1 1 0 5704080 680924 112676 0 0 6922 0 2340 5290 3 0
92 5
1 1 0 5674504 709444 112540 0 0 7130 8192 2404 5469 5 0
71 24
1 1 0 5646184 737144 112484 0 0 6924 0 2320 5293 5 0
85 10
1 1 0 5617460 765004 112484 0 0 6966 8192 5844 5329 5 0
75 20
2 2 0 5588264 793288 112500 0 0 7068 8192 5313 5404 4 0
85 11
1 1 0 5559556 821084 112628 0 0 6948 0 2326 5309 8 0
78 14
1 1 0 5530468 849304 112476 0 0 7054 8192 2374 5395 5 0
75 20
1 1 0 5501892 876956 112464 0 0 6912 8192 2321 5285 5 0
85 10
0 2 0 5472936 905044 112584 0 0 7024 0 5889 5370 5 0
70 25
0 2 0 5444476 933096 112596 0 0 7010 8192 5874 5360 4 0
82 13
0 2 0 5415520 960924 112476 0 0 6960 0 5841 5323 6 0
70 24
1 1 0 5386580 989096 112696 0 0 7038 8192 5282 5384 6 0
69 25
2 2 0 5357624 1017164 112688 0 0 7016 0 2358 5362 4
0 89 7
1 1 0 5328428 1045280 112580 0 0 7028 8192 2356 5379 5
0 80 15
0 2 0 5296688 1072396 112540 0 0 6778 50068 2314 5194 0
0 99 1
0 2 0 5297044 1072396 112616 0 0 0 64520 317 176 0
0 75 24
0 2 0 5297044 1072396 112616 0 0 0 64520 310 175 0
0 77 23
0 2 0 5297044 1072396 112616 0 0 0 64520 300 161 0
0 85 15
0 2 0 5297052 1072396 112616 0 0 0 72204 317 180 0
0 77 22
0 2 0 5297052 1072396 112616 0 0 0 64520 307 170 0
0 84 16
0 1 0 5300540 1072396 112616 0 0 0 21310 309 203 0
0 98 2
1 0 0 6351440 52252 112680 0 0 54 25 688 343 1 1
63 35
1 0 0 6269720 133036 112600 0 0 0 0 575 88 7 0
93 0
1 0 0 6186516 213812 112560 0 0 0 0 568 83 9 0
91 0
1 0 0 6103560 294588 112512 0 0 0 0 569 85 6 0
94 0
1 0 0 6020852 375428 112688 0 0 0 0 571 84 9 0
90 0
1 0 0 5937896 456244 112664 0 0 0 0 571 86 7 0
93 0
Writing to /dev/sdb4 starts when there is a step in the interrupt
column. As long as the interrupts are high there is bi related to this
writing. After initial buffering there is a first write to the disk at
40MB/s averaged over 2 seconds. Then only a couple of 8MB/s writes and
in the mean time the (kernel) buffer is growing up to 1072396 kB. Then
the driver starts writing at expected rates and the interrupt rate goes
down to a reasonable level. It is only at the end of the write that the
ahci driver gives back its buffer memory. After this, when the interrupt
rate goes to a level of about 570, the ahci driver is swallowing the
second write iteration with a block size of 2 bytes.
Here is the code fragment responsible for writing and measuring:
sync();
gettimeofday(&tstart, &tz);
for (int i=0; i<Ntot/N; ++i)
sdb4->write(buf,N);
sdb4->flush();
sdb4->close();
sync();
gettimeofday(&tstop, &tz);
N is the block size and sdb4 is
ofstream* sdb4 = new ofstream("/dev/sdb4", ofstream::binary);
I think Stan is right that this may be something in the ahci kernel driver.
I have some 3ware controller laying around. I might repeat the
experiments with this and post them here if someone is interested.
Dion
> If my explanation doesn't seem thorough enough that's because I'm not a
> kernel expert. I'm just have a little better than average knowledge/
> understanding of some of aspects of the kernel.
>
> If you want a really good explanation of the reasons behind this dd
> block size behavior while writing to a raw LVM device, try posting to
> lkml proper or one of the sub lists dealing with LVM and the block
> layer. Also, I'm sure some of the expert developers on the XFS list
> could answer this as well, though it would be a little OT there, unless
> of course your filesystem test yielding the 120MB/s was using XFS. ;)
>
> -- Stan
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Now I obtain:
dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
128 66.6928 16.0998
256 57.1125 18.8005
512 57.219 18.7655
1024 56.6571 18.9516
2048 55.5829 19.3179
4096 14.9638 71.7558
8192 15.6889 68.4395
16384 16.3382 65.7197
32768 15.2223 70.5372
65536 15.2356 70.4757
131072 15.2417 70.4474
262144 16.4634 65.2201
524288 15.2347 70.4802
Best result is obtained with Stan's golden rule bs=4096 and a lot of
interrupts when the bs is not an integral multiple of 4096.
int fd = open("/dev/sdb4", O_WRONLY | O_APPEND);
...
gettimeofday(&tstart, &tz);
for (int i=0; i<Ntot/N; ++i)
written+=write(fd, buf, N);
fsync(fd);
close(fd);
gettimeofday(&tstop, &tz);
Dion
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Yep, it's really dramatic on machines with low memory due to swapping.
When I first tested this phenomenon with a 1GB dd block size on my
machine with only 384 MB RAM and a 1GB swap partition, it took many
minutes to complete, vs tens of seconds using a 4kB block size. Almost
all of the 2GB of test was being pushed into swap, then being read from
swap and written to the file-- swap and file on the same physical disk.
This is one of the reasons I keep this old machine around--problems of
this nature show up more quickly and are more easily identified.
The dips at 1024 & 2048 are strange, but not entirely unexpected.
> The good and problematic block sizes do not really coincide with the
> ones I observe with dd, but the odd behaviour is there. There are some
> magic block sizes {1,1024, 2048} which cause a drop in performance.
> Looking at vmstat output at the same time I see unexpected bi and the
> interrupt rate goes sky high.
>
> In my case it is the ahci driver handling the writes. Here is the vmstat
> trace belonging to the bs=1 write and I add some more observations below:
Yeah, every platform will have quirks.
What do you see when you insert large delays between iterations, or run
each iteration after clearing the baffles, i.e.
$ echo 3 > /proc/sys/vm/drop_caches
> Here is the code fragment responsible for writing and measuring:
>
> sync();
> gettimeofday(&tstart, &tz);
> for (int i=0; i<Ntot/N; ++i)
> sdb4->write(buf,N);
> sdb4->flush();
> sdb4->close();
> sync();
> gettimeofday(&tstop, &tz);
>
> N is the block size and sdb4 is
>
> ofstream* sdb4 = new ofstream("/dev/sdb4", ofstream::binary);
>
>
> I think Stan is right that this may be something in the ahci kernel driver.
Could be. Could just need tweaking, say queue_depth, elevator, etc.
Did you test with all 3 elevators or just one?
> I have some 3ware controller laying around. I might repeat the
> experiments with this and post them here if someone is interested.
If it doesn't have the hardware write cache enabled you will likely see
worse performance than with the current drive/controller.
--
Stan
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/4E47B7EB...@hardwarefreak.com
User bs Actual bs
1 8191
2 8192
4 8192
8 8192
16 8192
32 8192
64 8192
128 8192
256 8192
512 8192
1024 1024
2048 2048
4096 4096
8192 8192
Except for writing single user bytes, libgcc does a good job in gathering the data into buffers with an integral buffer size of 8192 bytes. From a user bs of 1024 and further, it sticks to this buffer size for writing the data to kernel space. So that explains the results I obtained with the write method of ofstream. For all cases the kernel is addressed with a buffer size which is an integral multiple of 4096 the performance is good.
I think the one to less buffer size for the single byte case provides an option for improvement of libgcc.
Dion
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
I now think I understand the "strange" behaviour for block sizes not an
integral multiple of 4096 bytes. (Of course you guys already knew the
answer but just didn't want to make it easy for me to find the answer.)
The newer disks today have a sector size of 4096 bytes. They may still
be reporting 512 bytes, but this is to keep some ancient OS-es working.
When a block write is not an integral of 4096 bytes, for example 512
bytes, 4095 or 8191 bytes, the driver must first read the sector, modify
it and finally write it back to the disk. This explains the bi and the
increased number of interrupts.
I did some Google searches but did not find much. Can someone confirm
this hypothesis?
Best regards,
Dion
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
> I now think I understand the "strange" behaviour for block sizes not an
> integral multiple of 4096 bytes. (Of course you guys already knew the
> answer but just didn't want to make it easy for me to find the answer.)
>
> The newer disks today have a sector size of 4096 bytes. They may still
> be reporting 512 bytes, but this is to keep some ancient OS-es working.
>
> When a block write is not an integral of 4096 bytes, for example 512
> bytes, 4095 or 8191 bytes, the driver must first read the sector, modify
> it and finally write it back to the disk. This explains the bi and the
> increased number of interrupts.
>
> I did some Google searches but did not find much. Can someone confirm
> this hypothesis?
The read-modify-write performance penalty of unaligned partitions on the
"Advanced Format" drives (4KB native sectors) is a separate unrelated issue.
As I demonstrated earlier in this thread, the performance drop seen when
using dd with block sizes less than 4KB affects traditional 512B/sector
drives as well. If one has a misaligned partition on an Advanced Format
drive, one takes a double performance hit when dd bs is less than 4KB.
Again, everything in (x86) Linux is optimized around the 'magic' 4KB
size, including page size, filesystem block size, and LVM block size.
BTW, did you run your test with each of the elevators, as I recommended?
Do the following, testing dd after each change.
$ echo deadline > /sys/block/sdX/queue/scheduler
$ echo noop > /sys/block/sdX/queue/scheduler
$ echo cfq > /sys/block/sdX/queue/scheduler
Also, just for fun, and interesting results, increase your read_ahead_kb
from the default 128 to 512.
$ echo 512 > /sys/block/sdX/queue/read_ahead_kb
These changes are volatile so a reboot clears them in the event you're
unable to change them back to the defaults for any reason. This is
easily avoidable if you simply cat the files and write down the values
before changing them. After testing, echo the default values back in.
--
Stan
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/4E4EE96B...@hardwarefreak.com
On 8/19/2011 4:38 PM, Dion Kant wrote:I now think I understand the "strange" behaviour for block sizes not an integral multiple of 4096 bytes. (Of course you guys already knew the answer but just didn't want to make it easy for me to find the answer.) The newer disks today have a sector size of 4096 bytes. They may still be reporting 512 bytes, but this is to keep some ancient OS-es working. When a block write is not an integral of 4096 bytes, for example 512 bytes, 4095 or 8191 bytes, the driver must first read the sector, modify it and finally write it back to the disk. This explains the bi and the increased number of interrupts. I did some Google searches but did not find much. Can someone confirm this hypothesis?The read-modify-write performance penalty of unaligned partitions on the "Advanced Format" drives (4KB native sectors) is a separate unrelated issue. As I demonstrated earlier in this thread, the performance drop seen when using dd with block sizes less than 4KB affects traditional 512B/sector drives as well. If one has a misaligned partition on an Advanced Format drive, one takes a double performance hit when dd bs is less than 4KB. Again, everything in (x86) Linux is optimized around the 'magic' 4KB size, including page size, filesystem block size, and LVM block size.
BTW, did you run your test with each of the elevators, as I recommended? Do the following, testing dd after each change.
dom0-2:~ # echo deadline > /sys/block/sdc/queue/scheduler dom0-2:~ # cat /sys/block/sdc/queue/scheduler noop [deadline] cfq dom0-2:~ # ./bw Writing 1 GB bs time rate (bytes) (s) (MiB/s) 512 54.0373 19.8704 1024 1024 54.2937 19.7765 1024 2048 52.1781 20.5784 1024 4096 13.751 78.0846 1024 8192 13.8519 77.5159 1024 dom0-2:~ # echo noop > /sys/block/sdc/queue/scheduler dom0-2:~ # cat /sys/block/sdc/queue/scheduler [noop] deadline cfq dom0-2:~ # ./bw Writing 1 GB bs time rate (bytes) (s) (MiB/s) 512 53.9634 19.8976 1024 1024 52.0421 20.6322 1024 2048 54.0437 19.868 1024 4096 13.9612 76.9088 1024 8192 13.8183 77.7043 1024 dom0-2:~ # echo cfq > /sys/block/sdc/queue/scheduler dom0-2:~ # cat /sys/block/sdc/queue/scheduler noop deadline [cfq] dom0-2:~ # ./bw Writing 1 GB bs time rate (bytes) (s) (MiB/s) 512 56.0087 19.171 1024 1024 56.345 19.0565 1024 2048 56.0436 19.159 1024 4096 15.1232 70.9999 1024 8192 15.4236 69.6168 1024
Also, just for fun, and interesting results, increase your read_ahead_kb from the default 128 to 512. $ echo 512 > /sys/block/sdX/queue/read_ahead_kb
$ echo deadline > /sys/block/sdX/queue/scheduler
$ echo noop > /sys/block/sdX/queue/scheduler
$ echo cfq > /sys/block/sdX/queue/scheduler
These changes are volatile so a reboot clears them in the event you're unable to change them back to the defaults for any reason. This is easily avoidable if you simply cat the files and write down the values before changing them. After testing, echo the default values back in.