Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

ATA 4 KiB sector issues.

18 views
Skip to first unread message

Tejun Heo

unread,
Mar 7, 2010, 10:50:02 PM3/7/10
to
Hello, guys.

It looks like transition to ATA 4k drives will be quite painful and we
aren't really ready although these drives are already selling widely.
I've written up a summary document on the issue to clarify stuff as
it's getting more and more confusing and develop some consensus. It's
also on the linux ata wiki.

http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues

I've cc'd people whom I can think of off the top of my head but I
surely have missed some people who would have been interested. Please
feel free to add cc's or forward the message to other MLs.
Especially, I don't know much about partitioners so the details there
are pretty shallow and could be plain wrong. It would be great if
someone who knows more about this stuff can chime in.

Thanks.

=== Document follows ===

ATA 4 KiB sector issues

Background
==========

Up until recently, all ATA hard drives have been organized in 512 byte
sectors. For example, my 500 GB or 477 GiB hard drive is organized of
976773168 512 byte sectors numbered from 0 to 976773167. This is how
a drive communicates with the driver. When the operating system wants
to read 32 KiB of data at 1 MiB position, the driver asks the drive to
read 64 sectors from LBA (Logical block address, sector number) 2048.

Because each sector should be addressable, readable and writable
individually, the physical medium also is organized in the same sized
sectors. In addition to the area to store the actual data, each
sector requires extra space for book keeping - inter-sector space to
enable locating and addressing each sector and ECC data to detect and
correct inevitable raw data errors.

As the densities and capacities of hard drives keep growing, stronger
ECC becomes necessary to guarantee acceptable level of data integrity
increasing the space overhead. In addition, in most applications,
hard drives are now accessed in units of at least 8 sectors or 4096
bytes and maintaining 512 byte granularity has become somewhat
meaningless.

This reached a point where enlarging the sector size to 4096 bytes
would yield measurably more usable space given the same raw data
storage size and hard drive manufacturers are transitioning to 4 KiB
sectors.

Anandtech has a good article which illustrates the background and
issues with pretty diagrams[1].


Physical vs. Logical
====================

Because the 512 byte sector size has been around for a very long time
and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the
sector size assumption is scattered across all the layers -
controllers or bridge chips snooping commands, BIOSs, boot codes,
drivers, partitioners and system utilities, which makes it very
difficult to change the sector size from 512 byte without breaking
backward compatibility massively.

As a workaround, the concept of logical sector size was introduced.
The physical medium is organized in 4 KiB sectors but the firmware on
the drive will present it as if the drive is composed of 512 byte
sectors thus making the drive behave as before, so if the driver asks
the hard drive to read 64 sectors from LBA 2048, the firmware will
translate it and read 8 4 KiB sectors from hardware sector 256. As a
result, the hard drive now has two sector sizes - the physical one
which the physical media is actually organized in, and the logical one
which the firmware presents to the outside world.

A straight forward example mapping between physical sector and LBA
would be

LBA = 8 * phys_sect


Alignment problem on 4 KiB physical / 512 logical drives
=======================================================

This workaround keeps older hardware and software working while
allowing the drive to use larger sector size internally. However, the
discrepancy between physical and logical sector sizes creates an
alignment issue. For example, if the driver wants to read 7 sectors
from LBA 2047, the firmware has to read hardware sector 255 and 256
and trim leading 7*512 bytes and tailing 512 bytes.

For reads, this isn't an issue as drives read in larger chunks anyway
but for writes, the drive has to do read-modify-write to achieve the
requested action. It has to first read hardware sector 255 and 256,
update requested parts and then write back those sectors which can
cause significant performance degradation[2].

The problem is aggravated by the way DOS partitions[3] have been laid
out traditionally. For reasons dating back more than two decades,
they are laid out considering something called disk geometry which
nowadays are arbitrary values with a number of restrictions for
backward compatibility accumulated over the years. The end result is
that until recently (most Linux variants and upto Windows XP) the
first partition ends up on sector 63 and later ones on cylinder
boundaries where each cylinder usually is composed of 255 * 63
sectors.

Most modern filesystems generate 4 KiB aligned accesses from the
partition it is in. If a drive maps 4 KiB physical sectors to 512
byte logical sectors from LBA0, the filesystem in the first partition
will always be misaligned and filesystems in later partitions are
likely to be misaligned too.


Solving the alignment problem on 4 KiB physical / 512 logical drives
====================================================================

There are multiple ways which attempt to solve the problem.

S-1. Yet another workaround from the firmware - offset-by-one.

Yet another workaround which can be done by the firmware is to
offset physical to logical mapping by one logical sector such that
LBA 63 ends up on physical sector boundary, which aligns the first
partition to physical sectors without requiring any software update.
The example mapping between phys_sector and LBA becomes

LBA = 8 * phys_sect - 1

The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts
from after that point. phys_sect 1 maps to LBA 7 and phys_sect 8 to
63, making LBA 63 aligned on hardware sector.

Although this aligns only the first partition, for many use cases,
especially the ones involving older software, this workaround was
deemed useful and some recent drives with 4 KiB physical sectors are
equipped with a dip switch to turn on or off offset-by-one mapping.

S-2. The proper solution.

Correct alignments for all partitions can't be achieved by the
firmware alone. The system utilities should be informed about the
alignment requirements and align partitions accordingly.

The above firmware workaround complicates the situation because the
two different configurations require different offsets to achieve
the correct alignments. ATA/ATAPI-8 specifies a way for a drive to
export the physical and logical sector sizes and the LBA offset
which is aligned to the physical sectors.

In Linux, these parameters are exported via the following sysfs
nodes.

physical sector size : /sys/block/sdX/queue/physical_block_size
logical sector size : /sys/block/sdX/queue/logical_block_size
alignment offset : /sys/block/sdX/alignment_offset

Let the physical sector size be PSS, logical sector size LSS and
alignment offset AOFF. The system software should place partitions
such that the starting LBAs of all partitions are aligned on

(n * PSS + AOFF) / LSS

For 4 KiB physical sector offset-by-one drives, PSS is 4096, LSS 512
and AOFF 3584 and with n of 7 the above becomes,

(7 * 4096 + 3584) / 512 == 63

making sector 63 an aligned LBA where the first partition can be
put, but without the offset-by-one mapping, AOFF is zero and LBA 63
is not aligned.

With the above new alignment requirement in place, it becomes
difficult to honor the legacy one - first partition on sector 63 and
all other partitions on cylinder boundary (255 * 63 sectors) - as
the two alignment requirements contradict each other. This might be
worked around by adjusting how LBA and CHS addresses are mapped but
the disk geometry parameters are hard coded everywhere and there is
no reliable way to communicate custom geometry parameters.


Complications
=============

Unfortunately, there are complications.

C-1. The standard is not and won't be followed as-is.

Some of the existing BIOSs and/or drivers can't cope with drives
which report 4 KiB physical sector size. To work around this, some
drive models lie that its physical sector size is 512 bytes when the
actual configuration is 4 KiB without offsetting.

This nullifies the provisions for alignment in the ATA standard but
results in the correct alignment for Windows Vista and 7. OS
behaviors will be described further later.

For these drives, which are likely to continue to be shipped for the
foreseeable future, traditional LBA 63 and cylinder based aligning
results in misalignment.

C-2. Windows XP depends on the traditional partition layout.

Windows XP makes use of the CHS start/end addresses in the partition
table and gets confused if partitions are not laid out
traditionally. This means that XP can't be installed into a
partition prepared by later versions of Windows[4]. This isn't a
big problem for Windows because in most cases the later version is
replacing the older one, not the other way around.

Unfortunately, the situation is more complex for Linux because Linux
is often co-installed with various versions of Windows and XP is
still quite popular. This means that when a Linux partitioner is
used to prepare a partition which may be used by Windows, the
partitioner might have to consider which version of Windows is going
to be used and whether to align the partitions for the correct
alignment or compatibility with older versions of Windows.

C-3. The 2 TiB barrier and the possibility for 4 KiB logical sector size.

The DOS partition format uses 32 bit for the starting LBA and the
number of sectors and, reportedly, 32 bit Windows XP shares the
limitation. With 32 bit addressing and 512 byte logical sector
size, the maximum addressable sector + 1 is at

2^32 * 2^9 == 2^41 == 2 TiB

The DOS partition format allows a partition to reach beyond 2 TiB as
long as the starting LBA is under 2 TiB; however, both Windows XP
and and the Linux kernel (at least upto v2.6.33) refuse such
partition configurations.

With the right combination of host controller, BIOS and driver, this
barrier can be overcome by enlarging the logical sector size to 4
KiB, which will push the barrier out to 16 TiB. On the right
configuration, Windows XP is reportedly able to address beyond the 2
TiB barrier with a DOS partition and 4 KiB logical sector size.
Linux kernel upto v2.6.33 doesn't work under such configurations but
a patch to make it work is pending[5].

This might also be beneficial for operating systems which don't
suffer from this limitation. A different partition format - GPT[6]
- should be used beyond 2^32 sectors, which could harm compatibility
with older BIOSs or other operating systems which don't recognize
the new format.

As mentioned previously, 512 byte sector assumption has been there
for a very long time and changing it is likely to cause various
compatibility problems at many different layers from hardware up to
the system utilities.


Windows
=======

As hard drive vendors aim for performance and compatibility in modern
Windows environments, it is worthwhile to investigate how Windows
partitions with different alignment requirements. Up until Windows
XP, it followed the traditional layout - the first partition on LBA 63
and the others on cylinder boundaries where a cylinder is defined as
255 tracks with 63 sectors each.

Windows Vista and 7 align partitions differently. As the two behave
similarly, only 7's behavior is shown here. These partition tables
are created by Windows 7 RC installer on blank disks.

W-1. 512 byte physical and logical sector drive.

ST FIRST T LAST LBA NBLKS
80 202100 07 df130c 00080000 00200300
00 df140c 07 feffff 00280300 00689e12
00 000000 00 000000 00000000 00000000
00 000000 00 000000 00000000 00000000

Part0: FIRST C 0 H 32 S 33 : 2048 (63 sec/trk)
LAST C 12 H 223 S 19 : 206847 (255 heads/cyl)
LBA 2048 + 204800 = 206848

Part1: FIRST C 12 H 223 S 20 : 206848
LAST C 1023 H 254 S 63 : E
LBA 206848 + 312371200 = 312578048

Both aligned at (2048 * n). Part 1 not aligned to cylinder.

W-2. 4 KiB physical and 512 byte logical sector drive without offset-by-one.

ST FIRST T LAST LBA NBLKS
80 202100 07 df130c 00080000 00200300
00 df140c 07 feffff 00280300 00b83f25
00 000000 00 000000 00000000 00000000
00 000000 00 000000 00000000 00000000

Part0: FIRST C 0 H 32 S 33 : 2048 (63 sec/trk)
LAST C 12 H 223 S 19 : 206847 (255 heads/cyl)
LBA 2048 + 204800 = 206848

Part1: FIRST C 12 H 223 S 20 : 206848
LAST C 1023 H 254 S 63 : E
LBA 206848 + 624932864 = 625139712

Both aligned at (2048 * n). Part 1 not aligned to cylinder.

W-3. 4 KiB physical and 512 byte logical sector drive with offset-by-one.

ST FIRST T LAST LBA NBLKS
80 202800 07 df130c 07080000 f91f0300
00 df1b0c 07 feffff 07280300 f9376d74
00 000000 00 000000 00000000 00000000
00 000000 00 000000 00000000 00000000

Part0: FIRST C 0 H 32 S 40 : 2055 (63 sec/trk)
LAST C 12 H 223 S 19 : 206847 (255 heads/cyl)
LBA 2055 + 204793 = 206848

Part1: FIRST C 12 H 223 S 27 : 206855
LAST C 1023 H 254 S 63 : E
LBA 206855 + 1953314809 = 1953521664

Both aligned at (2048 * n + 7). Part 1 not aligned to cylinder.

The partitioner seems to be using 1M as the basic alignment unit and
offsetting from there if explicitly requested by the drive and there
is no difference between handling of 512 byte and 4 KiB drives, which
explains why C-1 works for hard drive vendors.

In all cases, the partitioner ignores both the first partition on LBA
63 and the others on cylinder boundary requirements while still using
the same 255*63 cylinder size. Also, note that in W-3, both part 0
and 1 end up with odd number of sectors. It seems that they simply
decided to completely break away from the traditional layout, which is
understandable given that there really isn't one good solution which
can cover all the cases and that the default larger alignment benefits
earlier SSDs.

Windows Vista basically shows the same behavior. Vista was tested by
creating two partitions using the management tool. Test data is
available at [7].

*-alignment_offset : alignment_offset reported by Linux kernel
*-fdisk : fdisk -l output
*-fdisk-u : fdisk -lu output
*-hdparm : hdparm -I output
*-mbr : dump of mbr
*-part : decoded partition table from mbr

Please note that hdparm is misreporting the alignment offset. It
should be reporting 512 instead of 256 for offset-by-one drives.


So, what now for Linux?
=======================

The situation is not easy. Considering all the factors, the only
workable solution looks like doing what Windows is doing. Hard drive
and SSD vendors are focusing on compatibility and performance on
recent Windows releases and are happy to do things which break the
standard defined mechanism as shown by C-1, so parting away from what
Windows does would be unnecessarily painful.

Unfortunately, while Windows can assume that newer releases won't
share the hard drive with older releases including Windows XP, Linux
distros can't do that. There will be many installations where a
modern Linux distros share a hard drive with older releases of
Windows. At this point, I can't see a silver bullet solution.

Partitioners maybe should only align partitions which will be used by
Linux and default to the traditional layout for others while allowing
explicit override. I think Windows XP wouldn't have problem with
differently aligned partitions as long as it doesn't actually use them
but haven't tested it.

Reportedly, commonly used partitioners aren't ready to handle drives
larger than 2 TiB in any configuration and alignment isn't done
properly for drives with 4 KiB physical sectors. 4 KiB logical sector
support is broken in both the kernel and partitioners. (need more
details and probably a whole section on partitioner behaviors)

Unfortunately, the transition to 4 KiB sector size, physical only or
logical too, is looking fairly ugly. Hopefully, a reasonable solution
can be reached in not too distant future but even with all the
software side updated, it looks like it's gonna cause significant
amount of confusion and frustration.


[1] http://www.anandtech.com/storage/showdoc.aspx?i=3691
[2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives
[3] http://en.wikipedia.org/wiki/Master_boot_record
[4] http://support.microsoft.com/kb/931760
[5] http://thread.gmane.org/gmane.linux.kernel/953981
[6] http://en.wikipedia.org/wiki/GUID_Partition_Table
[7] http://userweb.kernel.org/~tj/partalign/

* Mar 04 2009
Initial draft, Tejun Heo <t...@kernel.org>
* Mar 08 2009
Updated according to comments from Daniel Taylor
<Daniel...@wdc.com>. Other minor updates.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Greg Freemyer

unread,
Mar 8, 2010, 12:40:01 AM3/8/10
to
cc'ing Martin Petersen since I believe he is one of the most
knowledgeable kernel hackers on this topic and has been working the
issue for the last year.

> To unsubscribe from this list: send the line "unsubscribe linux-ide" in


> the body of a message to majo...@vger.kernel.org
> More majordomo info at �http://vger.kernel.org/majordomo-info.html
>

--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

James Bottomley

unread,
Mar 8, 2010, 2:10:02 AM3/8/10
to
Just a quick note:

The 2TB size for msdos partitions is a problem independent of the 4k
sector issue. Traditional 512 byte sector drives are now available in
those sizes. It looks like we're going to have to move to a new
partitioning label to solve this.

There's actually another barrier at 8 or 16TB, which is where a 4k
logical sector filesystem tops out using 32 bit block offsets (it's 8TB
if the fs hasn't been proof checked against sign extension problems).

However, for 4k sectors, the main issues which have shown up in testing
by others (mostly Martin) are

1. In native 4k mode, we work perfectly fine. *however*, most
BIOSs can't boot native 4k drives.
2. Even if the BIOS can boot native 4k, our own boot loaders seem
to be hard coded for 512 byte sectors in several places.
3. If we run in the 512 byte sector emulation mode, we end up with
the partition alignment problems you allude to.
4. The aligment problem is made more complex by drives that make
use of the offset exponent feature (what you refer to as offset
by one) ... fortunately very few of these have been seen in the
wild and we're hopeful they can be shot before they breed.
5. I'm really, really sorry to have to mention it, but it looks
like uefi is going to be the only way we can boot non-msdos
partitioned devices with native 4k sectors.

so the bottom line seems to be that if you want the device as a non boot
disk, use native 4k sectors and a non-msdos partition label. If you
want to boot from the drive and your bios won't book 4k natively,
partition everything using the 512 emulation and try to align the
partitions correctly. If your bios/uefi will boot 4k natively, just use
it and whatever partition label the bios/uefi supports.

Martin can fill in the pieces I've left out.

James

H. Peter Anvin

unread,
Mar 8, 2010, 3:00:03 AM3/8/10
to
On 03/07/2010 11:00 PM, James Bottomley wrote:
>
> The 2TB size for msdos partitions is a problem independent of the 4k
> sector issue. Traditional 512 byte sector drives are now available in
> those sizes. It looks like we're going to have to move to a new
> partitioning label to solve this.
>
> There's actually another barrier at 8 or 16TB, which is where a 4k
> logical sector filesystem tops out using 32 bit block offsets (it's 8TB
> if the fs hasn't been proof checked against sign extension problems).
>

The limit for the MS-DOS partition tables is 2^32 sectors. The patch
that Daniel posted was for a Linux kernel internal limit that set the
limit to 2 TB.

-hpa

H. Peter Anvin

unread,
Mar 8, 2010, 3:00:02 AM3/8/10
to
On 03/07/2010 11:00 PM, James Bottomley wrote:

I would very much like a reference for a platform which has firmware
which can successfully boot from 4K-logical media. It would be very
useful for bootloader testing.

Aligning partitions is something we should have done long ago. It
affects RAID and many flash drives just as much or more than 4K-sectored
disks.

Legacy BIOS doesn't care at all how the disk is partitioned, so as long
as the BIOS can read the disk at all the rest is up to the bootloader.
Of course, since there hasn't been the opportunity to test, bootloaders
generally don't handle it correctly (early versions of Syslinux
supported any sector size, but that bitrotted, and for the lack of
testing I eventually ended up hard-coding the number. Now I'd like to
get it working properly.)

As far as partitioning... I believe we should be using GPT partition
tables where possible. Even on non-EFI systems, it's simply a much
better partition table format.

-hpa

Martin K. Petersen

unread,
Mar 8, 2010, 10:30:02 AM3/8/10
to
>>>>> "Tejun" == Tejun Heo <t...@kernel.org> writes:

Tejun> The [Windows Vista/7] partitioner seems to be using 1M as the
Tejun> basic alignment unit and offsetting from there if explicitly
Tejun> requested by the drive

Yep.


Tejun> Please note that hdparm is misreporting the alignment offset. It
Tejun> should be reporting 512 instead of 256 for offset-by-one drives.

Already fixed. Your hdparm must be old.

Tejun> Partitioners maybe should only align partitions which will be
Tejun> used by Linux and default to the traditional layout for others
Tejun> while allowing explicit override.

I don't think we take the partition type into account. Karel?


Tejun> Reportedly, commonly used partitioners aren't ready to handle
Tejun> drives larger than 2 TiB in any configuration and alignment isn't
Tejun> done properly for drives with 4 KiB physical sectors. 4 KiB
Tejun> logical sector support is broken in both the kernel

Huh, what? My homedir is on a 4KiB LBS/PBS drive and has been for ~2
years.


Tejun> (need more details and probably a whole section on partitioner
Tejun> behaviors)

I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the
alignment work for fdisk and parted respectively. Karel, Jim: The full
writeup is here:

http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues

It'd be great if you guys could share what you have been doing to the
tooling.


Tejun> Unfortunately, the transition to 4 KiB sector size, physical only
Tejun> or logical too, is looking fairly ugly. Hopefully, a reasonable
Tejun> solution can be reached in not too distant future but even with
Tejun> all the software side updated, it looks like it's gonna cause
Tejun> significant amount of confusion and frustration.

With regards to XP compatibility I don't think we should go too much out
of our way to accommodate it. XP has been disowned by its master and I
think virtualization will take care of the rest.

FWIW, recent fdisk has a command line flag that will enable/disable DOS
compatible layout.

--
Martin K. Petersen Oracle Linux Engineering

Martin K. Petersen

unread,
Mar 8, 2010, 10:40:02 AM3/8/10
to
>>>>> "hpa" == H Peter Anvin <h...@zytor.com> writes:

hpa> I would very much like a reference for a platform which has
hpa> firmware which can successfully boot from 4K-logical media. It
hpa> would be very useful for bootloader testing.

I have yet to find one.


hpa> Aligning partitions is something we should have done long ago. It
hpa> affects RAID and many flash drives just as much or more than
hpa> 4K-sectored disks.

Yup.


hpa> As far as partitioning... I believe we should be using GPT
hpa> partition tables where possible. Even on non-EFI systems, it's
hpa> simply a much better partition table format.

Agreed.

--
Martin K. Petersen Oracle Linux Engineering

Martin K. Petersen

unread,
Mar 8, 2010, 10:40:01 AM3/8/10
to
>>>>> "James" == James Bottomley <James.B...@suse.de> writes:

James> However, for 4k sectors, the main issues which have shown up in
James> testing by others (mostly Martin) are

James> 1. In native 4k mode, we work perfectly fine. *however*,
James> most BIOSs can't boot native 4k drives.

Correct. I have engaged with pretty much all the big OEMs in the
industry and so far the interest has been near zero.


James> 4. The aligment problem is made more complex by drives that
James> make use of the offset exponent feature (what you refer
James> to as offset by one) ... fortunately very few of these
James> have been seen in the wild and we're hopeful they can be
James> shot before they breed.

This topic is constantly up for debate in IDEMA. However, it looks like
we might win because of the impending demise of XP.


James> so the bottom line seems to be that if you want the device as a
James> non boot disk, use native 4k sectors and a non-msdos partition
James> label. If you want to boot from the drive and your bios won't
James> book 4k natively, partition everything using the 512 emulation
James> and try to align the partitions correctly. If your bios/uefi
James> will boot 4k natively, just use it and whatever partition label
James> the bios/uefi supports.

James> Martin can fill in the pieces I've left out.

Here's my latest take given what I hear on the grapevine:

1. 512-byte logical block size drives will be around forever for legacy
deployments because nobody is willing to do the required BIOS int13
work. It's not just a BIOS thing, this requires heavy changes to HBA
boot ROMs as well.

2. Some vendors are working on EFI firmware and will support booting off
of 4KB LBS drives there. This is mostly aimed at the server space.

3. 4 KB logical block size drives will mainly be targeted for use inside
arrays. Off the shelf enterprise drive models will most likely
continue to ship with a 512-byte LBS.

4. Part of the hesitation to work on booting off of 4 KB lbs drives is
motivated by a general trend in the industry to move boot
functionality to SSD. There are 4 KB LBS SSDs out there but in
general the industry is sticking to ATA for local boot.

--
Martin K. Petersen Oracle Linux Engineering

Martin K. Petersen

unread,
Mar 8, 2010, 10:50:01 AM3/8/10
to
>>>>> "Martin" == Martin K Petersen <martin....@oracle.com> writes:

Martin> There are 4 KB LBS SSDs out there but in general the industry is
Martin> sticking to ATA for local boot.

Thus implying that ATA doesn't support 4 KB LBS, just that people stick
to the tried-and-true 512.

Martin K. Petersen

unread,
Mar 8, 2010, 10:50:01 AM3/8/10
to
>>>>> "Martin" == Martin K Petersen <martin....@oracle.com> writes:

>>>>> "Martin" == Martin K Petersen <martin....@oracle.com> writes:
Martin> There are 4 KB LBS SSDs out there but in general the industry is
Martin> sticking to ATA for local boot.

Martin> Thus implying that ATA doesn't support 4 KB LBS, just that
Martin> people stick to the tried-and-true 512.

*sigh* I haven't had my breakfast tea yet...

What I meant to say was that I know ATA supports 4 KB LBS and that
nobody appears to care about it.

H. Peter Anvin

unread,
Mar 8, 2010, 1:40:01 PM3/8/10
to
On 03/08/2010 07:18 AM, Martin K. Petersen wrote:
>
> Tejun> Partitioners maybe should only align partitions which will be
> Tejun> used by Linux and default to the traditional layout for others
> Tejun> while allowing explicit override.
>
> I don't think we take the partition type into account. Karel?
>

We should not take the partition type into account. The other aspect is
that FAT partitions need to be formatted differently to maintain the
alignment once set; I have recently contributed patches (which were
accepted) into mkdosfs to do the right thing there.

Looking at the Windows XP article, it looks like it is limited to
certain BIOSes; unfortunately it doesn't say what the particular BIOS
issue is. If we can find a system which actually exhibits the bug it
might be possible to reverse-engineer a solution.

> Tejun> Reportedly, commonly used partitioners aren't ready to handle
> Tejun> drives larger than 2 TiB in any configuration and alignment isn't
> Tejun> done properly for drives with 4 KiB physical sectors. 4 KiB
> Tejun> logical sector support is broken in both the kernel
>
> Huh, what? My homedir is on a 4KiB LBS/PBS drive and has been for ~2
> years.

For > 2 TiB drives with 4 KiB logical sectors and MS-DOS partition
tables, it is.

> Tejun> Unfortunately, the transition to 4 KiB sector size, physical only
> Tejun> or logical too, is looking fairly ugly. Hopefully, a reasonable
> Tejun> solution can be reached in not too distant future but even with
> Tejun> all the software side updated, it looks like it's gonna cause
> Tejun> significant amount of confusion and frustration.
>
> With regards to XP compatibility I don't think we should go too much out
> of our way to accommodate it. XP has been disowned by its master and I
> think virtualization will take care of the rest.

I think that's is wildly optimistic, but I do observe there is a fix
from Microsoft in the article you reference.

> FWIW, recent fdisk has a command line flag that will enable/disable DOS
> compatible layout.

Yes, unfortunately it is still on by default.

-hpa

H. Peter Anvin

unread,
Mar 8, 2010, 2:00:02 PM3/8/10
to
On 03/08/2010 07:41 AM, Martin K. Petersen wrote:
>>>>>> "Martin" == Martin K Petersen <martin....@oracle.com> writes:
>
>>>>>> "Martin" == Martin K Petersen <martin....@oracle.com> writes:
> Martin> There are 4 KB LBS SSDs out there but in general the industry is
> Martin> sticking to ATA for local boot.
>
> Martin> Thus implying that ATA doesn't support 4 KB LBS, just that
> Martin> people stick to the tried-and-true 512.
>
> *sigh* I haven't had my breakfast tea yet...
>
> What I meant to say was that I know ATA supports 4 KB LBS and that
> nobody appears to care about it.
>

Well, apparently Western Digital are looking at it for USB drives due to
XP compatibility requirements -- those presumably are ATA internally and
use a USB-ATA bridge.

On the flipside, though, there really is very little net benefit to 4K
as opposed to 512 byte logical sectors: the additional protocol overhead
is relatively minimal, and as long as writes are aligned full blocks,
there shouldn't be any additional overhead on either the OS or the drive
side. On the plus side, you get full compatibility with the existing
software stack. The equation really seems rather simple.

-hpa

James Bottomley

unread,
Mar 8, 2010, 2:00:02 PM3/8/10
to
On Mon, 2010-03-08 at 10:50 -0800, H. Peter Anvin wrote:
> On 03/08/2010 07:41 AM, Martin K. Petersen wrote:
> >>>>>> "Martin" == Martin K Petersen <martin....@oracle.com> writes:
> >
> >>>>>> "Martin" == Martin K Petersen <martin....@oracle.com> writes:
> > Martin> There are 4 KB LBS SSDs out there but in general the industry is
> > Martin> sticking to ATA for local boot.
> >
> > Martin> Thus implying that ATA doesn't support 4 KB LBS, just that
> > Martin> people stick to the tried-and-true 512.
> >
> > *sigh* I haven't had my breakfast tea yet...
> >
> > What I meant to say was that I know ATA supports 4 KB LBS and that
> > nobody appears to care about it.
> >
>
> Well, apparently Western Digital are looking at it for USB drives due to
> XP compatibility requirements -- those presumably are ATA internally and
> use a USB-ATA bridge.
>
> On the flipside, though, there really is very little net benefit to 4K
> as opposed to 512 byte logical sectors: the additional protocol overhead
> is relatively minimal, and as long as writes are aligned full blocks,
> there shouldn't be any additional overhead on either the OS or the drive
> side. On the plus side, you get full compatibility with the existing
> software stack. The equation really seems rather simple.

There's another problem that afflicts 4k drives emulating 512b: they
have to do a read modify write for any isolated 512b write ... that
leads to potential corruption of adjacent 512b blocks if power is lost
at the moment the write is being done. Since most Linux filesystems are
4k sectors, misalignment really hammers this, plus most journal writes
seem to be done in 512 byte increments. I suppose for USB this could be
regarded as flakey as usual, though.

James

H. Peter Anvin

unread,
Mar 8, 2010, 2:20:02 PM3/8/10
to
On 03/08/2010 10:58 AM, James Bottomley wrote:
>>
>> On the flipside, though, there really is very little net benefit to 4K
>> as opposed to 512 byte logical sectors: the additional protocol overhead
>> is relatively minimal, and as long as writes are aligned full blocks,
>> there shouldn't be any additional overhead on either the OS or the drive
>> side. On the plus side, you get full compatibility with the existing
>> software stack. The equation really seems rather simple.
>
> There's another problem that afflicts 4k drives emulating 512b: they
> have to do a read modify write for any isolated 512b write ... that
> leads to potential corruption of adjacent 512b blocks if power is lost
> at the moment the write is being done. Since most Linux filesystems are
> 4k sectors, misalignment really hammers this, plus most journal writes
> seem to be done in 512 byte increments. I suppose for USB this could be
> regarded as flakey as usual, though.
>

Misalignment sucks in general. This is nothing new - the RAID and flash
people have had these problems for a long time now. It's clear we need
to align our filesystems, period.

As to the read-modify-write issue: to some degree there is very little
you can do about it other than a big enough capacitor. If you can't
write a sector atomically and have it stick, you're screwed no matter what.

-hpa

Mike Snitzer

unread,
Mar 8, 2010, 2:40:02 PM3/8/10
to

I've been keeping track of all the pieces in play, have coordinated
with kzak and jim, and have a summary that offers some amount of macro
detail (at the end I touch on parted and fdisk):

http://people.redhat.com/msnitzer/docs/io-limits.txt

Karel Zak

unread,
Mar 8, 2010, 3:00:03 PM3/8/10
to
On Mon, Mar 08, 2010 at 10:18:27AM -0500, Martin K. Petersen wrote:
> >>>>> "Tejun" == Tejun Heo <t...@kernel.org> writes:
> Tejun> Partitioners maybe should only align partitions which will be
> Tejun> used by Linux and default to the traditional layout for others
> Tejun> while allowing explicit override.
>
> I don't think we take the partition type into account. Karel?

Yes, you're right.

(IMHO our goal should be to minimize number of places where anything
depends on partition type.)

> Tejun> Reportedly, commonly used partitioners aren't ready to handle
> Tejun> drives larger than 2 TiB in any configuration and alignment isn't

The limit is specific for DOS partition table (with 512-byte log.
sectors), but for example GPT uses 64-bit LBA. I believe that our
partitioning tools don't introduce any other restriction.

> Tejun> done properly for drives with 4 KiB physical sectors. 4 KiB
> Tejun> logical sector support is broken in both the kernel
>
> Huh, what? My homedir is on a 4KiB LBS/PBS drive and has been for ~2
> years.
>
>
> Tejun> (need more details and probably a whole section on partitioner
> Tejun> behaviors)
>
> I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the
> alignment work for fdisk and parted respectively. Karel, Jim: The full
> writeup is here:
>
> http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>
> It'd be great if you guys could share what you have been doing to the
> tooling.

small summary:

- libblkid provides unified API to topology information, it supports:
- ioctls (kernel >= 2.6.32)
- sysfs (kernel >= 2.6.31)
- stripe chunk size and stripe width for DM, MD. LVM and evms on
old kernels
- libparted and fdisk are linked against libblkid

- fdisk supports 4KiB logical sector size (util-linux-ng >= 2.15
- fdisk supports 4KiB physical sector size (util-linux-ng >= 2.17)
- fdisk uses 1MiB alignment (or more if optimal I/O size is bigger)
and alignment_offset for all partitions in non-DOS mode
(util-linux-ng >= 2.17.1)

- parted supports 4KiB physical sector size
- parted uses 1MiB alignment for disks with unknown topology, disks
with topology information are aligned to optimal (or minimum) I/O
size (parted >= 2.1)

- EFI GPT code in the kernel has been updated to works properly with
4KiB sectors (kernel >= 2.6.33)

- mkfs.{ext,xfs,gfs2,ocfs2} have been update to work properly with
topology information, mkfs.{ext,xfs} are linked against libblkid
for compatibility with old kernel (for stripe chunk size / width)

- Fedora-13/RHEL6 installer uses libparted with 4KiB support

- alignment_offset & 4KiB support is planned for LUKS (cryptsetup)

> Tejun> Unfortunately, the transition to 4 KiB sector size, physical only
> Tejun> or logical too, is looking fairly ugly. Hopefully, a reasonable
> Tejun> solution can be reached in not too distant future but even with
> Tejun> all the software side updated, it looks like it's gonna cause
> Tejun> significant amount of confusion and frustration.
>
> With regards to XP compatibility I don't think we should go too much out
> of our way to accommodate it. XP has been disowned by its master and I
> think virtualization will take care of the rest.
>
> FWIW, recent fdisk has a command line flag that will enable/disable DOS
> compatible layout.

yes, util-linux-ng 2.17.1, fdisk -c

Note that non-DOS mode will be default in the next major
util-linux-ng release.

Karel

--
Karel Zak <kz...@redhat.com>

Martin K. Petersen

unread,
Mar 8, 2010, 3:10:02 PM3/8/10
to
>>>>> "hpa" == H Peter Anvin <h...@zytor.com> writes:

>> Huh, what? My homedir is on a 4KiB LBS/PBS drive and has been for ~2
>> years.

hpa> For > 2 TiB drives with 4 KiB logical sectors and MS-DOS partition
hpa> tables, it is.

Ah, that. Already fixed, I believe.


>> With regards to XP compatibility I don't think we should go too much
>> out of our way to accommodate it. XP has been disowned by its master
>> and I think virtualization will take care of the rest.

hpa> I think that's is wildly optimistic,

I don't expect XP to go away any time soon. But do I think that the
number of fresh XP installs in combination with Linux will be fairly
limited. And general lack of hardware enablement will eventually kill
off XP on raw metal.

I think it's ok that we have stop-gap solutions in place for
interoperability. But I wouldn't want to waste all our resources on
designing for the past. I'm much more interested in making sure that
single-boot Linux is doing the right thing.


>> FWIW, recent fdisk has a command line flag that will enable/disable
>> DOS compatible layout.

hpa> Yes, unfortunately it is still on by default.

I agree that this is a don't-be-broken option and I would prefer it the
other way around (I know that's the plan for the next release. I just
hope the distributions get things right).

--
Martin K. Petersen Oracle Linux Engineering

Cláudio Martins

unread,
Mar 8, 2010, 3:20:02 PM3/8/10
to

On Tue, 09 Mar 2010 00:28:25 +0530 James Bottomley <James.B...@suse.de> wrote:
>
> There's another problem that afflicts 4k drives emulating 512b: they
> have to do a read modify write for any isolated 512b write ... that
> leads to potential corruption of adjacent 512b blocks if power is lost
> at the moment the write is being done. Since most Linux filesystems are
> 4k sectors, misalignment really hammers this, plus most journal writes
> seem to be done in 512 byte increments. I suppose for USB this could be
> regarded as flakey as usual, though.
>

Most users assume that a single 512B sector write is atomic as far as
power failure is concerned. Hasn't this requirement been carried over
to the new 4k physical sector?

It seems reasonable that if a 512B sector write is atomic in the older
drives, a 4k sector write would also be atomic on the newer drives,
since the time required to write it is negligible when compared to
capacitor voltage decay and inertia of the disk platters.

Anyway, I suppose most of the energy/time required for a sector write
operation, is being expended on head assembly positioning and the wait
for the correct sector passing under the write head. That is, the write
operation itself takes so little time that it should make no difference
whether you write 512B or 4k.

So the question is: what are hard drive makers guaranteeing (if
anything at all)? Was a 512B sector write really atomic? Is a 4k one?
Or was it completely manufacturer-dependent to start?

Regards

Cláudio

H. Peter Anvin

unread,
Mar 8, 2010, 3:20:02 PM3/8/10
to
On 03/08/2010 07:18 AM, Martin K. Petersen wrote:
>
> I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the
> alignment work for fdisk and parted respectively. Karel, Jim: The full
> writeup is here:
>
> http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>
> It'd be great if you guys could share what you have been doing to the
> tooling.
>

Please correct the following bit in C-3:

"A different partition format - GPT[6] - should be used beyond 2^32
sectors, which could harm compatibility with older BIOSs or other
operating systems which don't recognize the new format."

BIOS does not care about the partition table format. There might be
issues with > 2^32 sectors for BIOSes (e.g. truncating sector counts),
but that would be unrelated.

-hpa

Martin K. Petersen

unread,
Mar 8, 2010, 3:30:01 PM3/8/10
to
>>>>> "hpa" == H Peter Anvin <h...@zytor.com> writes:

hpa> On the flipside, though, there really is very little net benefit to
hpa> 4K as opposed to 512 byte logical sectors: the additional protocol
hpa> overhead is relatively minimal, and as long as writes are aligned
hpa> full blocks, there shouldn't be any additional overhead on either
hpa> the OS or the drive side. On the plus side, you get full
hpa> compatibility with the existing software stack. The equation
hpa> really seems rather simple.

4KB sectors are not a win for anybody except the drive vendors.

There is a push in the industry right now to keep the 512-byte logical
blocks forever. The first step would be to report misaligned accesses
or accesses that are not a multiple of the physical block size. Second
step would be to eventually reject any write that's not a properly
aligned multiple of the physical block size.

--
Martin K. Petersen Oracle Linux Engineering

Martin K. Petersen

unread,
Mar 8, 2010, 4:10:02 PM3/8/10
to
>>>>> "Cláudio" == Cláudio Martins <ct...@ist.utl.pt> writes:

Cláudio> So the question is: what are hard drive makers guaranteeing (if
Cláudio> anything at all)?

No guarantees. Nothing that you can get in writing, anyway.


Cláudio> Was a 512B sector write really atomic?

Sometimes.


Cláudio> Is a 4k one?

Sometimes, maybe.

The problem with 4KB physical blocks is that if you do a partial or
misaligned write you'll end up having to do read-modify-write. And that
introduces are scenario where a subsequent write error will affect
logical blocks that were not part of the I/O request.

However, you also have that with regular drives because they often write
more than the actual block undergoing I/O. For instance to reduce
hotspot bleed to adjacent sectors.

There have been several unsuccessful attempts at nudging the drive
vendors into giving us real guarantees (supercapacitors, NVRAM or
flash-backed write cache). No luck so far. So people that care use
arrays with non-volatile caches.

--
Martin K. Petersen Oracle Linux Engineering

H. Peter Anvin

unread,
Mar 8, 2010, 4:20:02 PM3/8/10
to
On 03/08/2010 12:19 PM, Martin K. Petersen wrote:
>>>>>> "hpa" == H Peter Anvin <h...@zytor.com> writes:
>
> hpa> On the flipside, though, there really is very little net benefit to
> hpa> 4K as opposed to 512 byte logical sectors: the additional protocol
> hpa> overhead is relatively minimal, and as long as writes are aligned
> hpa> full blocks, there shouldn't be any additional overhead on either
> hpa> the OS or the drive side. On the plus side, you get full
> hpa> compatibility with the existing software stack. The equation
> hpa> really seems rather simple.
>
> 4KB sectors are not a win for anybody except the drive vendors.
>

Obviously. However, larger physical storage unit sizes -- 4K for
spinning media, but frequently much larger for flash, for example -- is
already in wide use, and having a huge mishmash of logical block sizes
isn't going to work very well.

> There is a push in the industry right now to keep the 512-byte logical
> blocks forever. The first step would be to report misaligned accesses
> or accesses that are not a multiple of the physical block size. Second
> step would be to eventually reject any write that's not a properly
> aligned multiple of the physical block size.

I personally suspect that that is the way it is going to go, rather than
trying to change the software ecosystem to a different logical block
size. It has been tried in the past and failed, with the sole exception
of CD-ROMs, pretty much.

-hpa

Tejun Heo

unread,
Mar 8, 2010, 9:30:01 PM3/8/10
to
Hello,

On 03/09/2010 05:12 AM, H. Peter Anvin wrote:
> Please correct the following bit in C-3:
>
> "A different partition format - GPT[6] - should be used beyond 2^32
> sectors, which could harm compatibility with older BIOSs or other
> operating systems which don't recognize the new format."
>
> BIOS does not care about the partition table format. There might be
> issues with > 2^32 sectors for BIOSes (e.g. truncating sector counts),
> but that would be unrelated.

Updated to,

This might also be beneficial for operating systems which don't
suffer from this limitation. A different partition format - GPT[6]


- should be used beyond 2^32 sectors, which could harm compatibility

with other operating systems which don't recognize the new format.

Thanks.

--
tejun

Tejun Heo

unread,
Mar 8, 2010, 9:40:01 PM3/8/10
to
Hello,

On 03/09/2010 04:58 AM, Karel Zak wrote:
>> Tejun> Reportedly, commonly used partitioners aren't ready to handle
>> Tejun> drives larger than 2 TiB in any configuration and alignment isn't
>
> The limit is specific for DOS partition table (with 512-byte log.
> sectors), but for example GPT uses 64-bit LBA. I believe that our
> partitioning tools don't introduce any other restriction.

Hmmm... the 'reportedly' was from Daniel Taylor or maybe I just
misinterpreted the conversation. Daniel, can you please fill in?

>> Tejun> done properly for drives with 4 KiB physical sectors. 4 KiB
>> Tejun> logical sector support is broken in both the kernel
>>
>> Huh, what? My homedir is on a 4KiB LBS/PBS drive and has been for ~2
>> years.

By default, they aren't aligned properly, are they?

>> Tejun> (need more details and probably a whole section on partitioner
>> Tejun> behaviors)
>>
>> I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the
>> alignment work for fdisk and parted respectively. Karel, Jim: The full
>> writeup is here:
>>
>> http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>>
>> It'd be great if you guys could share what you have been doing to the
>> tooling.
>
> small summary:
>
> - libblkid provides unified API to topology information, it supports:
> - ioctls (kernel >= 2.6.32)
> - sysfs (kernel >= 2.6.31)
> - stripe chunk size and stripe width for DM, MD. LVM and evms on
> old kernels
> - libparted and fdisk are linked against libblkid
>
> - fdisk supports 4KiB logical sector size (util-linux-ng >= 2.15
> - fdisk supports 4KiB physical sector size (util-linux-ng >= 2.17)
> - fdisk uses 1MiB alignment (or more if optimal I/O size is bigger)
> and alignment_offset for all partitions in non-DOS mode
> (util-linux-ng >= 2.17.1)

That's great. Daniel, maybe you were testing older versions? Or
maybe those failures were manifested from libata mishandling 4KiB r/w
requets.

> - parted supports 4KiB physical sector size
> - parted uses 1MiB alignment for disks with unknown topology, disks
> with topology information are aligned to optimal (or minimum) I/O
> size (parted >= 2.1)

This will result in incorrect alignment for drives which lie about the
physical sector size to work around BIOS/drivers issues (C-1). It
would probably be best to align to at least 1MiB.

> - EFI GPT code in the kernel has been updated to works properly with
> 4KiB sectors (kernel >= 2.6.33)

libata is broken for logical 4KiB ATA devices tho. I'll fix it up.

> - mkfs.{ext,xfs,gfs2,ocfs2} have been update to work properly with
> topology information, mkfs.{ext,xfs} are linked against libblkid
> for compatibility with old kernel (for stripe chunk size / width)
>
> - Fedora-13/RHEL6 installer uses libparted with 4KiB support
>
> - alignment_offset & 4KiB support is planned for LUKS (cryptsetup)
>
>> Tejun> Unfortunately, the transition to 4 KiB sector size, physical only
>> Tejun> or logical too, is looking fairly ugly. Hopefully, a reasonable
>> Tejun> solution can be reached in not too distant future but even with
>> Tejun> all the software side updated, it looks like it's gonna cause
>> Tejun> significant amount of confusion and frustration.
>>
>> With regards to XP compatibility I don't think we should go too much out
>> of our way to accommodate it. XP has been disowned by its master and I
>> think virtualization will take care of the rest.

Yeah, good point. I'm just a bit worried that it might generate a lot
of frustrated bug reports. Well, maybe we should just advise users to
install windows first and then install Linux.

>> FWIW, recent fdisk has a command line flag that will enable/disable DOS
>> compatible layout.
>
> yes, util-linux-ng 2.17.1, fdisk -c
>
> Note that non-DOS mode will be default in the next major
> util-linux-ng release.

I'll try to merge these information into the ata-4k doc.

Thank you very much.

--
tejun

Jeff Garzik

unread,
Mar 8, 2010, 9:50:02 PM3/8/10
to
On 03/08/2010 09:34 PM, Tejun Heo wrote:
> libata is broken for logical 4KiB ATA devices tho. I'll fix it up.

Does libata-dev.git#sectsize miss any details?

Jeff

Tejun Heo

unread,
Mar 8, 2010, 9:50:02 PM3/8/10
to
Hello, again.

On 03/09/2010 11:34 AM, Tejun Heo wrote:
>> - parted uses 1MiB alignment for disks with unknown topology, disks
>> with topology information are aligned to optimal (or minimum) I/O
>> size (parted >= 2.1)
>
> This will result in incorrect alignment for drives which lie about the
> physical sector size to work around BIOS/drivers issues (C-1). It
> would probably be best to align to at least 1MiB.

I misread it. C-1 would be disks w/o alignment information which will
be aligned to optimal_io_size which again would be 0 and thus 1MiB
alignment. So, this should work, right?

Thanks.

Tejun Heo

unread,
Mar 8, 2010, 9:50:01 PM3/8/10
to
Hello,

On 03/09/2010 12:18 AM, Martin K. Petersen wrote:
>>>>>> "Tejun" == Tejun Heo <t...@kernel.org> writes:
> Tejun> Please note that hdparm is misreporting the alignment offset. It
> Tejun> should be reporting 512 instead of 256 for offset-by-one drives.
>
> Already fixed. Your hdparm must be old.

Yeah, I know Mark fixed it but couldn't find where the tree was. SF
only had old releases, so...

(other stuff replied further down the thread)

Thanks.

--
tejun

Tejun Heo

unread,
Mar 8, 2010, 10:00:02 PM3/8/10
to
Hello,

On 03/09/2010 04:34 AM, Mike Snitzer wrote:
> I've been keeping track of all the pieces in play, have coordinated
> with kzak and jim, and have a summary that offers some amount of macro
> detail (at the end I touch on parted and fdisk):
>
> http://people.redhat.com/msnitzer/docs/io-limits.txt

Ah... this is great. I'll link the doc and shamelessly steal parts of
it if that's okay with you.

Thanks.

--
tejun

Tejun Heo

unread,
Mar 8, 2010, 10:00:02 PM3/8/10
to
Hello,

On 03/09/2010 11:42 AM, Jeff Garzik wrote:
> On 03/08/2010 09:34 PM, Tejun Heo wrote:
>> libata is broken for logical 4KiB ATA devices tho. I'll fix it up.
>
> Does libata-dev.git#sectsize miss any details?

I haven't looked at it yet. I'll review it soon but the thing is
without actual hardware it would be a bit difficult to tell. It's not
only the drivers. I have this mighty unhappy feeling that some
controllers (especially some of the SATA ones with internal state
machine to emulate SFF) would be sniffing the commands and making the
wrong assumption if 4KiB logical sector size is used, so we'll need to
test various controllers. Some PATA-SATA bridge chips will definitely
be having problems too. Then there are the USB and other bridges too
but well those aren't libata's problem at least. :-)

Thanks.

--
tejun

Martin K. Petersen

unread,
Mar 8, 2010, 10:20:02 PM3/8/10
to
>>>>> "Tejun" == Tejun Heo <hte...@gmail.com> writes:

>> This will result in incorrect alignment for drives which lie about
>> the physical sector size to work around BIOS/drivers issues (C-1).
>> It would probably be best to align to at least 1MiB.

Tejun> I misread it. C-1 would be disks w/o alignment information which
Tejun> will be aligned to optimal_io_size which again would be 0 and
Tejun> thus 1MiB alignment. So, this should work, right?

Correct. ATA only provides physical block size whereas SCSI has the
extra knobs in the block limits VPD. And consequently ATA block devices
have min_io = physical block size and optimal_io = 0.

So we'll align to 1 MB by default.

--
Martin K. Petersen Oracle Linux Engineering

Martin K. Petersen

unread,
Mar 8, 2010, 10:20:02 PM3/8/10
to
>>>>> "Tejun" == Tejun Heo <t...@kernel.org> writes:

>>> Huh, what? My homedir is on a 4KiB LBS/PBS drive and has been for
>>> ~2 years.

Tejun> By default, they aren't aligned properly, are they?

Single partition. I did the alignment manually.


Tejun> libata is broken for logical 4KiB ATA devices tho. I'll fix it
Tejun> up.

Matthew implemented support for this a while back...


Tejun> I'm just a bit worried that it might generate a lot of frustrated
Tejun> bug reports. Well, maybe we should just advise users to install
Tejun> windows first and then install Linux.

Unfortunately there is no simple solution given that we can't go back in
time and fix legacy DOS/XP behavior.

The 1-alignment jumper (that some drives have) fixes things for the
first partition but will mess up our alignment for subsequent ones
unless the firmware actually reports the shift. So no matter what we do
the user will have to have a bare minimum of knowledge about 512-byte
LBS/4 KB PBS drives. That sucks. But even Windows users are presented
with extra documentation and alignment utilities during the transition.

Having a 1 MB alignment by default and hoping that devices that lie will
be 0-aligned is the best we can do, I think.

--
Martin K. Petersen Oracle Linux Engineering

Martin K. Petersen

unread,
Mar 8, 2010, 10:30:02 PM3/8/10
to
>>>>> "Tejun" == Tejun Heo <t...@kernel.org> writes:

Tejun> Yeah, I know Mark fixed it but couldn't find where the tree was.
Tejun> SF only had old releases, so...

Tejun> (other stuff replied further down the thread)

Looks like Mark hasn't made an hdparm release since I posted the patch.
It's here:

http://marc.info/?l=linux-ide&m=126427438620651&w=2

--
Martin K. Petersen Oracle Linux Engineering

Martin K. Petersen

unread,
Mar 8, 2010, 10:30:02 PM3/8/10
to
>>>>> "Tejun" == Tejun Heo <t...@kernel.org> writes:

>> http://people.redhat.com/msnitzer/docs/io-limits.txt

Tejun> Ah... this is great. I'll link the doc and shamelessly steal
Tejun> parts of it if that's okay with you.

There's also this one:

http://oss.oracle.com/~mkp/docs/linux-advanced-storage.pdf

It is more aimed at storage vendors than end users, though.

--
Martin K. Petersen Oracle Linux Engineering

Daniel Taylor

unread,
Mar 8, 2010, 10:50:02 PM3/8/10
to

-----Original Message-----
From: Tejun Heo [mailto:t...@kernel.org]
Sent: Monday, March 08, 2010 6:34 PM
To: Karel Zak
Cc: Martin K. Petersen; linu...@vger.kernel.org; lkml; Daniel Taylor; Jeff
Garzik; Mark Lord; ty...@mit.edu; H. Peter Anvin;
hiro...@mail.parknet.co.jp; Andrew Morton; Alan Cox; irt...@gmail.com;
Matthew Wilcox; asch...@suse.de; knik...@suse.de; jdel...@suse.de; Jim
Meyering
Subject: Re: ATA 4 KiB sector issues.

Hello,

On 03/09/2010 04:58 AM, Karel Zak wrote:
>> Tejun> Reportedly, commonly used partitioners aren't ready to handle
>> Tejun> drives larger than 2 TiB in any configuration and alignment

>> Tejun> isn't


>
> The limit is specific for DOS partition table (with 512-byte log.
> sectors), but for example GPT uses 64-bit LBA. I believe that our
> partitioning tools don't introduce any other restriction.

Hmmm... the 'reportedly' was from Daniel Taylor or maybe I just
misinterpreted the conversation. Daniel, can you please fill in?

DLT> The problem that I see is that the installers and upper level
applications do not make good choices for partition layout.
DLT> "parted", itself, seems to work OK in the latest version. One of the
things I've heard since I started this process is that
DLT> there are some libraries associated with the process of
partitioning/formatting. Perhaps the upper layers and those
DLT> libraries aren't synced up?

DLT> As I said, above, it could be libraries. I was not aware that so much
of the implementation was embedded there.

> - parted supports 4KiB physical sector size
> - parted uses 1MiB alignment for disks with unknown topology, disks
> with topology information are aligned to optimal (or minimum) I/O
> size (parted >= 2.1)

This will result in incorrect alignment for drives which lie about the
physical sector size to work around BIOS/drivers issues (C-1). It would
probably be best to align to at least 1MiB.

DLT> Please.

> - EFI GPT code in the kernel has been updated to works properly with
> 4KiB sectors (kernel >= 2.6.33)

libata is broken for logical 4KiB ATA devices tho. I'll fix it up.

> - mkfs.{ext,xfs,gfs2,ocfs2} have been update to work properly with
> topology information, mkfs.{ext,xfs} are linked against libblkid
> for compatibility with old kernel (for stripe chunk size / width)
>
> - Fedora-13/RHEL6 installer uses libparted with 4KiB support
>
> - alignment_offset & 4KiB support is planned for LUKS (cryptsetup)
>
>> Tejun> Unfortunately, the transition to 4 KiB sector size, physical

>> Tejun> only or logical too, is looking fairly ugly. Hopefully, a
>> Tejun> reasonable solution can be reached in not too distant future
>> Tejun> but even with all the software side updated, it looks like
>> Tejun> it's gonna cause significant amount of confusion and frustration.


>>
>> With regards to XP compatibility I don't think we should go too much
>> out of our way to accommodate it. XP has been disowned by its master
>> and I think virtualization will take care of the rest.

Yeah, good point. I'm just a bit worried that it might generate a lot of
frustrated bug reports. Well, maybe we should just advise users to install
windows first and then install Linux.

DLT> Simple reality is that XP is "forever". Drives >2TiB, which may be
USB-attached, used with XP will be MBR-partitioned
DLT> and use 4096-byte sectors. We need to be able to read/write those
disks on Linux systems.

>> FWIW, recent fdisk has a command line flag that will enable/disable
>> DOS compatible layout.
>
> yes, util-linux-ng 2.17.1, fdisk -c
>
> Note that non-DOS mode will be default in the next major
> util-linux-ng release.

I'll try to merge these information into the ata-4k doc.

Thank you very much.

DLT> One last comment: I just tried to partition and format a >2TiB drive on
fully updated Ubuntu 9.10 with GParted.
DLT> I selected not to cylinder align, use GPT and ext3, and to put 1 MiB
preceeding and following. libparted failed
DLT> with "unable to satisfy all constraints of the partition". Using
"parted", I created the partition, and then
DLT> GParted was able to apply the ext3 file system.

Martin K. Petersen

unread,
Mar 9, 2010, 12:00:02 AM3/9/10
to
>>>>> "DLT" == Daniel Taylor <Daniel...@wdc.com> writes:

DLT> Simple reality is that XP is "forever". Drives >2TiB, which may be

DLT> USB-attached, used with XP will be MBR-partitioned and use
DLT> 4096-byte sectors. We need to be able to read/write those disks on
DLT> Linux systems.

Shouldn't be a problem as long as the DOS partition table vs. 4 KiB
sectors thing is fixed.


DLT> One last comment: I just tried to partition and format a >2TiB

DLT> drive on fully updated Ubuntu 9.10 with GParted. I selected not to
DLT> cylinder align, use GPT and ext3, and to put 1 MiB preceeding and
DLT> following. libparted failed with "unable to satisfy all
DLT> constraints of the partition". Using "parted", I created the
DLT> partition, and then GParted was able to apply the ext3 file system.

I don't think ubuntu has adopted any of the relevant updates yet.

I believe the Fedora 13 Alpha is due to be released this week. That
would be the best test platform because several of the people who have
been actively engaged in the 4 KiB sector enablement process are Fedora
developers.

--
Martin K. Petersen Oracle Linux Engineering

Mikael Abrahamsson

unread,
Mar 9, 2010, 1:40:01 AM3/9/10
to
On Mon, 8 Mar 2010, Tejun Heo wrote:

> http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues

Excellent summary.

> C-2. Windows XP depends on the traditional partition layout.

Is this really true? WD ships their EARS drives with an alignment tool
that as far as I can understand, moves the partition so
it's aligned to 4KiB:

http://www.wdc.com/en/products/advancedformat/

So an XP fresh install (including letting XP partition the drive) will be
misaligned, but if you clone xp onto a properly aligned partition (or run
the tool and let it move the partition), it'll be ok. So saying that XP
"depends" on traditional partition layout might be a bit of a streth?

--
Mikael Abrahamsson email: swm...@swm.pp.se

Michael Tokarev

unread,
Mar 9, 2010, 2:00:01 AM3/9/10
to
Mike Snitzer wrote:
[]

> I've been keeping track of all the pieces in play, have coordinated
> with kzak and jim, and have a summary that offers some amount of macro
> detail (at the end I touch on parted and fdisk):
>
> http://people.redhat.com/msnitzer/docs/io-limits.txt

What I don't see in this thread and in this document is - any mention
of linux md layer. I think it is the first candidate to test the whole
thing, the easiest and most important one. I mean the alignment and
"recommended I/O size" and all this similar stuff.

Think of a raid5 array - with all the mentioned good stuff in place
fdisk should figure out to align partitions on the array stripe
boundary, and should do that automatically. And this should be
most easy to debug/test, since the whole thing is controllable
by kernel.

But apparently it does not implement anything of this sort.
Adding Neilb to the Cc list.......

Thanks!

/mjt

Jim Meyering

unread,
Mar 9, 2010, 2:30:02 AM3/9/10
to
Karel Zak wrote:
> On Mon, Mar 08, 2010 at 10:18:27AM -0500, Martin K. Petersen wrote:
...

Thanks for the summary, Karel.
In case anyone wants more high-level detail on the parted front,
here's its NEWS file:

http://git.debian.org/?p=parted/parted.git;a=blob;f=NEWS

Currently, I'm not planning much for Parted, other than clean-up.
For example, I want to remove all of the FS-related code (it's
horribly bit-rotted) from the package, with the exception of
HFS/HFS+ and FAT resizing capabilities, since AFAIK, Parted
has the only free implementations. If any of you know of other
implementations or work in progress, please let me know.


Related information, prompted by my recent encounter with a
tool that refused to let me use a GPT partition table.

Partition table formats: prefer GUID/GPT:

Having spent more than my share of time looking at partition table
formats recently, I am now strongly biased against DOS partition
tables, and for GUID/GPT ones. In addition to allowing for >2GiB
partition offsets and lengths, GPT tables provide for better
protection in case of corruption (checksums, backup table at end
of disk) and don't have the anachronistic distinction of primary
and extended/logical partitions (all partitions are "primary").
You can even give each partition a name. The only reason to use a
DOS partition table on a new installation is if you're stuck with
a requirement of using an OS like XP on bare metal.

Please consider encouraging the use of GPT partition tables...
or at least do not *dis*courage their use.

Karel Zak

unread,
Mar 9, 2010, 5:10:01 AM3/9/10
to
On Tue, Mar 09, 2010 at 09:53:37AM +0300, Michael Tokarev wrote:
> Mike Snitzer wrote:
> []
> > I've been keeping track of all the pieces in play, have coordinated
> > with kzak and jim, and have a summary that offers some amount of macro
> > detail (at the end I touch on parted and fdisk):
> >
> > http://people.redhat.com/msnitzer/docs/io-limits.txt
>
> What I don't see in this thread and in this document is - any mention
> of linux md layer. I think it is the first candidate to test the whole
> thing, the easiest and most important one. I mean the alignment and
> "recommended I/O size" and all this similar stuff.
>
> Think of a raid5 array - with all the mentioned good stuff in place
> fdisk should figure out to align partitions on the array stripe
> boundary, and should do that automatically. And this should be

Yes. For userspace there is not a difference between RAID and non-RAID
device -- the topology support in kernel provides unified API to all
devices. It means we needn't any extra support for RAIDs in
fdisk/parted. The userspace tools follow topology data from kernel.

The good thing with 1MiB default alignment is that it is usable for
usual stripe sizes (for sizes greater than 1MiB we use optimal I/O
size).

> most easy to debug/test, since the whole thing is controllable
> by kernel.

I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
It works as expected. (Note that kernel 2.6.31 has a problem with
alignment_offset calculation on stacked devices, so use the latest
kernel where the bug is already fixed.)

But I didn't tried to use unpartitioned (whole) 4K disks for RAIDs,
because scsi_debug does not allow to create more devices (and I don't
have a real HW).

Some tests are available in util-linux-ng sources:
http://git.kernel.org/?p=utils/util-linux-ng/util-linux-ng.git;a=tree;f=tests/ts/fdisk

Karel


# modprobe scsi_debug dev_size_mb=2500 sector_size=512 physblk_exp=3

[..create partitions...]

# fdisk -lcu /dev/sdb

Disk /dev/sdb: 2621 MB, 2621440000 bytes
255 heads, 63 sectors/track, 318 cylinders, total 5120000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 32768 bytes
Disk identifier: 0xb585b0be

Device Boot Start End Blocks Id System
/dev/sdb1 2048 1026047 512000 83 Linux
/dev/sdb2 1026048 2050047 512000 83 Linux
/dev/sdb3 2050048 3074047 512000 83 Linux
/dev/sdb4 3074048 4098047 512000 83 Linux


# mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}

[...create partitions on the raid...]

# fdisk -lcu /dev/md8

Disk /dev/md8: 1572 MB, 1572667392 bytes
2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 65536 bytes / 65536 bytes
Disk identifier: 0x1bb6fd8d

Device Boot Start End Blocks Id System
/dev/md8p1 2048 1435647 716800 83 Linux
/dev/md8p2 1435648 2869247 716800 83 Linux


Check offsets (alignment):

# cat /sys/block/sdb/sdb{1,2,3,4}/alignment_offset
0
0
0
0

# cat /sys/block/md8/md8p{1,2}/alignment_offset
0
0

--
Karel Zak <kz...@redhat.com>

Michal Soltys

unread,
Mar 9, 2010, 5:20:02 AM3/9/10
to
Mikael Abrahamsson wrote:
> On Mon, 8 Mar 2010, Tejun Heo wrote:
>
>> http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>
> Excellent summary.
>
>> C-2. Windows XP depends on the traditional partition layout.
>
> Is this really true? WD ships their EARS drives with an alignment tool
> that as far as I can understand, moves the partition so
> it's aligned to 4KiB:
>

XP SP2 (or later) can boot from any place, including logical partitions
(tested that recently). Most important thing is "hidden sectors" (recent
chain.c32 can set that automatically through ntldr and/or sethidden
options). No idea about pre-SP2 ; Win 2000 will not boot from "misaligned"
(with reference to cylinder boundary) partition.

Michael Tokarev

unread,
Mar 9, 2010, 5:20:02 AM3/9/10
to
Karel Zak wrote:
> On Tue, Mar 09, 2010 at 09:53:37AM +0300, Michael Tokarev wrote:
[]

>> Think of a raid5 array - with all the mentioned good stuff in place
>> fdisk should figure out to align partitions on the array stripe
>> boundary, and should do that automatically. And this should be
>
> Yes. For userspace there is not a difference between RAID and non-RAID
> device -- the topology support in kernel provides unified API to all
> devices. It means we needn't any extra support for RAIDs in
> fdisk/parted. The userspace tools follow topology data from kernel.
>
> The good thing with 1MiB default alignment is that it is usable for
> usual stripe sizes (for sizes greater than 1MiB we use optimal I/O
> size).

No, it's not that simple. For raid5 (and I especially mentioned raid5
above), raid4 and raid6, 1MiB is only good when the number of devices
is 2^N+1 (for raid[45]) or 2^N+2 (for raid6). For raid5 that means
3, 5, 9, 17, .. disks. In all other cases the alignment (which should
match stripe size) will not be power of two. For example, for a 4-disk
raid5 array with 1MiB chunk size the partitions should be aligned at
3MiB boundaries. For 6-disk raid5 with 256KiB chunk size it is
5x256=1280 Kib. And so on.

Yes it has little to do with the $subject (4KiB sectors), but it is
closely related still.

>> most easy to debug/test, since the whole thing is controllable
>> by kernel.
>
> I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
> It works as expected.

Actually, for raid0, the alignment is questionable. Should it be a
multiple of chunk size or whole stripe size? I'm not sure, both ways
has bad and good sides.. But if it is the latter, the same issues
pops up again: do a 3-disk raid0 and you'll have to align to 3*2^N.

[]


> Disk /dev/sdb: 2621 MB, 2621440000 bytes
> 255 heads, 63 sectors/track, 318 cylinders, total 5120000 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 32768 bytes

Good.

> # mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}

That's 3-disk stripe size with default 64Kb chunk size, which makes
3x64=320KiB - the number to which everything should be aligned.

> # fdisk -lcu /dev/md8
>
> Disk /dev/md8: 1572 MB, 1572667392 bytes
> 2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 65536 bytes / 65536 bytes

And here we go: fdisk does not see the right number: nothing
is dividable by 3.

[]


> # cat /sys/block/md8/md8p{1,2}/alignment_offset
> 0
> 0

And that's where the issue is. md does not {sup,re}port all
this stuff yet.

This is what I'm talking about.

Thanks!

/mjt

Dave Chinner

unread,
Mar 9, 2010, 6:20:02 AM3/9/10
to
On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
> Karel Zak wrote:
> > I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
> > It works as expected.
>
> Actually, for raid0, the alignment is questionable. Should it be a
> multiple of chunk size or whole stripe size? I'm not sure, both ways
> has bad and good sides.. But if it is the latter, the same issues
> pops up again: do a 3-disk raid0 and you'll have to align to 3*2^N.

Yes, alignment is still needed, especially for filesystems that can
do stripe unit aligned allocation like XFS. If you don't align the
filesystem properly, all the data IO will be mis-aligned to the
underlying disks and stripe unit sized IO will hit multiple disks
rather than just one....

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Michael Tokarev

unread,
Mar 9, 2010, 6:40:02 AM3/9/10
to
Dave Chinner wrote:
> On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
>> Karel Zak wrote:
>>> I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
>>> It works as expected.
>> Actually, for raid0, the alignment is questionable. Should it be a
>> multiple of chunk size or whole stripe size? I'm not sure, both ways
>> has bad and good sides.. But if it is the latter, the same issues
>> pops up again: do a 3-disk raid0 and you'll have to align to 3*2^N.
>
> Yes, alignment is still needed, especially for filesystems that can
> do stripe unit aligned allocation like XFS. If you don't align the
> filesystem properly, all the data IO will be mis-aligned to the
> underlying disks and stripe unit sized IO will hit multiple disks
> rather than just one....

I understand alignment is needed, the question is if the alignment
should be to chunk size or full-stripe size. In neither case it
will be bad for underlying disks.

/mjt

Karel Zak

unread,
Mar 9, 2010, 7:00:02 AM3/9/10
to
On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
> Karel Zak wrote:
> > # mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}
>
> That's 3-disk stripe size with default 64Kb chunk size, which makes
> 3x64=320KiB - the number to which everything should be aligned.
>
> > # fdisk -lcu /dev/md8
> >
> > Disk /dev/md8: 1572 MB, 1572667392 bytes
> > 2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
> > Units = sectors of 1 * 512 = 512 bytes
> > Sector size (logical/physical): 512 bytes / 4096 bytes
> > I/O size (minimum/optimal): 65536 bytes / 65536 bytes
>
> And here we go: fdisk does not see the right number: nothing
> is dividable by 3.
>
> []
> > # cat /sys/block/md8/md8p{1,2}/alignment_offset
> > 0
> > 0
>
> And that's where the issue is. md does not {sup,re}port all
> this stuff yet.
>
> This is what I'm talking about.

Note that I have 2.6.31.12-174.2.22.fc12.x86_64 kernel on my laptop.
It would be better for serious tests to use 2.6.33.

Karel

--
Karel Zak <kz...@redhat.com>

Karel Zak

unread,
Mar 9, 2010, 7:30:02 AM3/9/10
to
On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
> Karel Zak wrote:
> > # mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}
>
> That's 3-disk stripe size with default 64Kb chunk size, which makes
> 3x64=320KiB - the number to which everything should be aligned.
>
> > # fdisk -lcu /dev/md8
> >
> > Disk /dev/md8: 1572 MB, 1572667392 bytes
> > 2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
> > Units = sectors of 1 * 512 = 512 bytes
> > Sector size (logical/physical): 512 bytes / 4096 bytes
> > I/O size (minimum/optimal): 65536 bytes / 65536 bytes
>
> And here we go: fdisk does not see the right number: nothing
> is dividable by 3.

Well, the same setup with 2.6.34-0.9.rc0.git13.fc14.x86_64:

# fdisk -luc /dev/sdb

Disk /dev/sdb: 2621 MB, 2621440000 bytes
255 heads, 63 sectors/track, 318 cylinders, total 5120000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 32768 bytes

Disk identifier: 0x77fbab55

Device Boot Start End Blocks Id System
/dev/sdb1 2048 1026047 512000 83 Linux
/dev/sdb2 1026048 2050047 512000 83 Linux
/dev/sdb3 2050048 3074047 512000 83 Linux
/dev/sdb4 3074048 4098047 512000 83 Linux

# mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}


# fdisk -luc /dev/md8

Disk /dev/md8: 1572 MB, 1572667392 bytes
2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 65536 bytes / 65536 bytes


# cat /sys/block/md8/queue/{minimum,optimal}_io_size
65536
65536

> > # cat /sys/block/md8/md8p{1,2}/alignment_offset
> > 0
> > 0
>
> And that's where the issue is. md does not {sup,re}port all
> this stuff yet.

Hmm...

Karel

--
Karel Zak <kz...@redhat.com>

Dave Chinner

unread,
Mar 9, 2010, 7:30:02 AM3/9/10
to
On Tue, Mar 09, 2010 at 02:38:57PM +0300, Michael Tokarev wrote:
> Dave Chinner wrote:
> > On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
> >> Karel Zak wrote:
> >>> I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
> >>> It works as expected.
> >> Actually, for raid0, the alignment is questionable. Should it be a
> >> multiple of chunk size or whole stripe size? I'm not sure, both ways
> >> has bad and good sides.. But if it is the latter, the same issues
> >> pops up again: do a 3-disk raid0 and you'll have to align to 3*2^N.
> >
> > Yes, alignment is still needed, especially for filesystems that can
> > do stripe unit aligned allocation like XFS. If you don't align the
> > filesystem properly, all the data IO will be mis-aligned to the
> > underlying disks and stripe unit sized IO will hit multiple disks
> > rather than just one....
>
> I understand alignment is needed, the question is if the alignment
> should be to chunk size or full-stripe size. In neither case it
> will be bad for underlying disks.

Depends on the RAID implementation. High end RAID arrays often have
cache bypass features that are triggered by stripe width aligned and
sized IOs. cwWhen receiving well formed IO they can more than double
write performance because they are not limited by internal cache
mirroring bandwidth (e.g. the controller magically switches to
write-through for those well formed IOs instead of writeback).

So from that perspective, alignment needs to be to stripe width,
not stripe unit. Similarly for RAID5/6 alignment needs to be to
stripe width, so that a well formed IO issued by the filesystem
only hits one RAID5/6 stripe.

FWIW, XFS takes great care to ensure that it doesn't place all it's
allocation group headers on the same stripe unit. Failing to
distribute the AG headers across all the ѕtripe units evenly loads
the disks/luns in the stripe unevenly. As soon as you have uneven
load on a stripe the performance tanks as stripe is only as fast as
it's slowest member.

Also, while XFS prefers to align to stripe unit, there are mount
options to change the default allocation alignment to be stripe
width based. Hence if you have large files and applications that are
doing well formed IO, stripe width alignment of the filesystem to
the underlying block device is critical to acheiving deterministic
throughput close to the maximum the hardware can support.....

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Mark Lord

unread,
Mar 9, 2010, 9:00:02 AM3/9/10
to
On 03/07/10 22:48, Tejun Heo wrote:
..

> Please note that hdparm is misreporting the alignment offset. It
> should be reporting 512 instead of 256 for offset-by-one drives.
..

That issue was fixed quite a while ago.
Upgrade your elderly copy of hdparm.

:)

Mark Lord

unread,
Mar 9, 2010, 9:40:01 AM3/9/10
to
On 03/08/10 22:18, Martin K. Petersen wrote:
>>>>>> "Tejun" == Tejun Heo<t...@kernel.org> writes:
>
> Tejun> Yeah, I know Mark fixed it but couldn't find where the tree was.
> Tejun> SF only had old releases, so...
>
> Tejun> (other stuff replied further down the thread)
>
> Looks like Mark hasn't made an hdparm release since I posted the patch.
..

Holy crap. I thought I'd put that out months ago!

Anyway, it's there now: https://sourceforge.net/projects/hdparm/

Thanks!

Daniel Taylor

unread,
Mar 9, 2010, 5:40:02 PM3/9/10
to

hpa> I would very much like a reference for a platform which has
hpa> firmware which can successfully boot from 4K-logical media. It
hpa> would be very useful for bootloader testing.


I am told that the Mac UEFI platform will boot from 4K logical/physical
drives.

Now I have to scrounge one of the old drives to test it.

Greg Freemyer

unread,
Mar 9, 2010, 5:50:02 PM3/9/10
to
<snip>
>
> As far as partitioning... I believe we should be using GPT partition tables
> where possible. �Even on non-EFI systems, it's simply a much better
> partition table format.
>
> � � � �-hpa

GPT can not be used for boot disks in non-EFI systems, right?

Greg

Arnd Bergmann

unread,
Mar 9, 2010, 6:50:02 PM3/9/10
to
On Monday 08 March 2010 04:48:35 Tejun Heo wrote:
> Unfortunately, while Windows can assume that newer releases won't
> share the hard drive with older releases including Windows XP, Linux
> distros can't do that. There will be many installations where a
> modern Linux distros share a hard drive with older releases of
> Windows. At this point, I can't see a silver bullet solution.
>
> Partitioners maybe should only align partitions which will be used by
> Linux and default to the traditional layout for others while allowing
> explicit override. I think Windows XP wouldn't have problem with
> differently aligned partitions as long as it doesn't actually use them
> but haven't tested it.

Any idea if XP can cope with partition tables that use a 32-sector, 128-head
geometry rather than the default 63-sector, 255-head one? That seems to
be what some flash memory cards are using and it would make any cylinder
aligned partition also 4096-byte aligned, at the cost of moving the
1024-cylinder boundary from 7.88 GiB to 2 GiB.

Do we know of anything that requires 63s/255h?

Arnd

Tejun Heo

unread,
Mar 9, 2010, 7:00:01 PM3/9/10
to
Hello,

On 03/09/2010 04:27 PM, Jim Meyering wrote:
> Related information, prompted by my recent encounter with a
> tool that refused to let me use a GPT partition table.
>
> Partition table formats: prefer GUID/GPT:
>
> Having spent more than my share of time looking at partition table
> formats recently, I am now strongly biased against DOS partition
> tables, and for GUID/GPT ones. In addition to allowing for >2GiB
> partition offsets and lengths, GPT tables provide for better
> protection in case of corruption (checksums, backup table at end
> of disk) and don't have the anachronistic distinction of primary
> and extended/logical partitions (all partitions are "primary").
> You can even give each partition a name. The only reason to use a
> DOS partition table on a new installation is if you're stuck with
> a requirement of using an OS like XP on bare metal.
>
> Please consider encouraging the use of GPT partition tables...
> or at least do not *dis*courage their use.

I'll surely include it.

Thanks.

--
tejun

Tejun Heo

unread,
Mar 9, 2010, 7:10:02 PM3/9/10
to
On 03/09/2010 10:55 PM, Mark Lord wrote:
> On 03/07/10 22:48, Tejun Heo wrote:
> ..
>> Please note that hdparm is misreporting the alignment offset. It
>> should be reporting 512 instead of 256 for offset-by-one drives.
> ..
>
> That issue was fixed quite a while ago.
> Upgrade your elderly copy of hdparm.

Heh heh, *you* were keeping it from me! Anyways, is there hdparm
devel tree published somewhere? I wandared the SF page for quite a
bit (which for some reason is very difficult to find things in) but I
couldn't find one. If it's not, it might be a good idea to put it on
SF or git.kernel.org?

Thanks.

--
tejun

Tejun Heo

unread,
Mar 9, 2010, 7:10:02 PM3/9/10
to
Hello,

On 03/10/2010 07:46 AM, Greg Freemyer wrote:
>> As far as partitioning... I believe we should be using GPT partition tables
>> where possible. Even on non-EFI systems, it's simply a much better
>> partition table format.
>

> GPT can not be used for boot disks in non-EFI systems, right?

IIUC, I think any BIOS should be able to do so as it only cares about
the code part of MBR not the partitions and even with GPT the MBR
remains the same with the partition part describing the rest of the
while disk as a single chunk containing GPT managed area. The only
problem is the older operating systems (like XP) which don't
understand GPT wouldn't be able to access those partitions.

Thanks.

--
tejun

Tejun Heo

unread,
Mar 9, 2010, 7:20:02 PM3/9/10
to
Hello,

On 03/09/2010 07:06 PM, Michal Soltys wrote:
> Mikael Abrahamsson wrote:
>> On Mon, 8 Mar 2010, Tejun Heo wrote:
>>
>>> http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>>
>> Excellent summary.
>>
>>> C-2. Windows XP depends on the traditional partition layout.
>>
>> Is this really true? WD ships their EARS drives with an alignment tool
>> that as far as I can understand, moves the partition so
>> it's aligned to 4KiB:

Hmmm... I based that claim on the MS KB page and as you pointed out
the problem there could probably be issues with specific BIOS
implementation interacting badly. I'll update the doc.

> XP SP2 (or later) can boot from any place, including logical partitions
> (tested that recently). Most important thing is "hidden sectors" (recent
> chain.c32 can set that automatically through ntldr and/or sethidden
> options). No idea about pre-SP2 ; Win 2000 will not boot from
> "misaligned" (with reference to cylinder boundary) partition.

I was thinking about testing XP booting this weekend but really want
to avoid it, so thanks a lot for the info. I'll update the doc
accordingly but can you please enlighten me on how it works and what's
broken in detail? So, XP should be fine with any alignment?

Thanks.

--
tejun

Tejun Heo

unread,
Mar 9, 2010, 7:30:02 PM3/9/10
to
On 03/10/2010 08:46 AM, Arnd Bergmann wrote:
> On Monday 08 March 2010 04:48:35 Tejun Heo wrote:
>> Unfortunately, while Windows can assume that newer releases won't
>> share the hard drive with older releases including Windows XP, Linux
>> distros can't do that. There will be many installations where a
>> modern Linux distros share a hard drive with older releases of
>> Windows. At this point, I can't see a silver bullet solution.
>>
>> Partitioners maybe should only align partitions which will be used by
>> Linux and default to the traditional layout for others while allowing
>> explicit override. I think Windows XP wouldn't have problem with
>> differently aligned partitions as long as it doesn't actually use them
>> but haven't tested it.
>
> Any idea if XP can cope with partition tables that use a 32-sector, 128-head
> geometry rather than the default 63-sector, 255-head one? That seems to
> be what some flash memory cards are using and it would make any cylinder
> aligned partition also 4096-byte aligned, at the cost of moving the
> 1024-cylinder boundary from 7.88 GiB to 2 GiB.
>
> Do we know of anything that requires 63s/255h?

Michal Soltys pointed out that XP doesn't really depend on the legacy
layout although 2000 does (can't boot), so I guess it shouldn't be
much of a problem.

Regarding the gemetry, IIUC changing it isn't meaningful for
compatibility. Geometry information is obtained using a BIOS call
(the int Xh thing) and the hard disk itself doesn't carry that
information , so unless you go into the BIOS set up and enter those
values manually (and I don't think you can do that on many BIOSs these
days), there's no way for anyone else to know custom geometry other
than solving equations using the CHS and LBA information in the
partition table.

So, feeding custom geometry to a partitioner which uses CHS to
determine the layout is useful to make it create partitions aligned in
certain way but as the information regarding the geometry is not
recorded anywhere, others will just keep using whatever they were
using (255*63) and figure that CHS and LBA in the partition tables
just don't match.

Thanks.

--
tejun

Daniel Taylor

unread,
Mar 9, 2010, 7:30:02 PM3/9/10
to

>> GPT can not be used for boot disks in non-EFI systems, right?

> IIUC, I think any BIOS should be able to do so as it only cares about the
code part of MBR
> not the partitions and even with GPT the MBR remains the same with the
partition part
> describing the rest of the while disk as a single chunk containing GPT
managed area. The
> only problem is the older operating systems (like XP) which don't
understand GPT wouldn't be
> able to access those partitions.

> Thanks.

The MBR in a GPT installation doesn't map the first GPT partition, it maps
the entire drive
drive after the first sector, as well as marking it type 0xEE. The start
LBA of the file system
is not correctly located in the MBR.

I will run some experiments to see if any of the systems on my desk can boot
Linux from a GPT.

Tejun Heo

unread,
Mar 9, 2010, 7:30:02 PM3/9/10
to
Hello,

On 03/10/2010 09:14 AM, Daniel Taylor wrote:
> The MBR in a GPT installation doesn't map the first GPT partition,
> it maps the entire drive drive after the first sector, as well as
> marking it type 0xEE.

Yeah, yeah, that was exactly what I was saying by "describing the rest
of the whole disk as a single chunk containing GPT managed area" with
a typo making "whole" "while".

> The start LBA of the file system is not correctly located in the
> MBR.

Sure it's not but MBR belongs to the boot loader not the BIOS. BIOS
just needs to load MBR and handles control to it. If the MBR or more
likely later stages of the bootloader loaded by MBR knows how to boot
from GPT, it should work.

> I will run some experiments to see if any of the systems on my desk can boot
> Linux from a GPT.

I'm not sure about grub although I strongly suspect recent version of
it should work but AFAICS lilo should definitely work as it doesn't
care how the disk is logically organized at all.

Thanks.

--
tejun

H. Peter Anvin

unread,
Mar 9, 2010, 7:40:01 PM3/9/10
to
On 03/09/2010 02:46 PM, Greg Freemyer wrote:
> <snip>
>>
>> As far as partitioning... I believe we should be using GPT partition tables
>> where possible. Even on non-EFI systems, it's simply a much better
>> partition table format.
>>
>> -hpa
>
> GPT can not be used for boot disks in non-EFI systems, right?
>

It can. The BIOS doesn't care about the partition table at all -- all
it does is load the MBR.

-hpa

H. Peter Anvin

unread,
Mar 9, 2010, 7:40:02 PM3/9/10
to
On 03/09/2010 04:26 PM, Tejun Heo wrote:
>
>> I will run some experiments to see if any of the systems on my desk can boot
>> Linux from a GPT.
>
> I'm not sure about grub although I strongly suspect recent version of
> it should work but AFAICS lilo should definitely work as it doesn't
> care how the disk is logically organized at all.
>

In the case of Syslinux, you have to install gptmbr.bin, but otherwise
it works unmodified (Syslinux itself doesn't care about the partition
table at all.)

Note: the official standard for GPT booting on BIOS is still evolving,
so I might change gptmbr to match the new standard.

-hpa

Tejun Heo

unread,
Mar 9, 2010, 7:40:02 PM3/9/10
to
Hello,

On 03/09/2010 03:50 AM, H. Peter Anvin wrote:
> Well, apparently Western Digital are looking at it for USB drives due to
> XP compatibility requirements -- those presumably are ATA internally and
> use a USB-ATA bridge.

This should work right now as long as the bridge chip doesn't screw
up, which we can't do much about anyway. USB is used as SCSI
transport and SCSI layer has been working fine with devices with
differing sector sizes for quite some time.

> On the flipside, though, there really is very little net benefit to 4K
> as opposed to 512 byte logical sectors: the additional protocol overhead
> is relatively minimal, and as long as writes are aligned full blocks,
> there shouldn't be any additional overhead on either the OS or the drive
> side. On the plus side, you get full compatibility with the existing
> software stack. The equation really seems rather simple.

Yeap, for addressing, whether 9 bit is shifted or 12 doesn't really
matter. That's only 8 times difference which may be breached in
probably under three years. If the current 48 bit addressing limit is
reached, we would be far better off introducing 64 or 128 bit
addressing. That was the reason why I thought that I would never see
an ATA disk w/ 4KiB logical sector and got pretty surprised that it
was being considered for XP compatibility where 3 year offset could be
pretty meaningful.

Thanks.

--
tejun

Martin K. Petersen

unread,
Mar 10, 2010, 12:00:01 AM3/10/10
to
>>>>> "Michael" == Michael Tokarev <m...@tls.msk.ru> writes:

[MD I/O topology support]

Michael> But apparently it does not implement anything of this sort.
Michael> Adding Neilb to the Cc list.......

git show 8f6c2e4b

--
Martin K. Petersen Oracle Linux Engineering

Martin K. Petersen

unread,
Mar 10, 2010, 12:10:01 AM3/10/10
to
>>>>> "Karel" == Karel Zak <kz...@redhat.com> writes:

[Cleaned up the CC: list from hell]

Karel> # cat /sys/block/md8/queue/{minimum,optimal}_io_size
Karel> 65536 65536

This one had me puzzled. We set min_io and opt_io correctly in raid5.c
depending on number of non-parity disks. And yet it turns into
something nonsensical after.

Turns out we overrun unsigned int calculating the lowest common multiple
in the stacking function. That's why we ended up with the wrong value.

I never noticed this because my userland topology regression test tool
uses unsigned long.

I'll get a patch off to Jens right away.

--
Martin K. Petersen Oracle Linux Engineering

H. Peter Anvin

unread,
Mar 10, 2010, 12:20:02 AM3/10/10
to
On 03/09/2010 04:14 PM, Daniel Taylor wrote:
>
> The MBR in a GPT installation doesn't map the first GPT partition, it maps
> the entire drive
> drive after the first sector, as well as marking it type 0xEE. The start
> LBA of the file system
> is not correctly located in the MBR.
>
> I will run some experiments to see if any of the systems on my desk can boot
> Linux from a GPT.

There is something called a "hybrid MBR", which is basically a GPT disk
with a single partition (the current bootable partition) mapped as an
MBR partition, instead of marking the whole disk 0xEE.

-hpa

Mark Lord

unread,
Mar 10, 2010, 1:10:02 AM3/10/10
to
On 03/09/10 19:00, Tejun Heo wrote:
> On 03/09/2010 10:55 PM, Mark Lord wrote:
>> On 03/07/10 22:48, Tejun Heo wrote:
>> ..
>>> Please note that hdparm is misreporting the alignment offset. It
>>> should be reporting 512 instead of 256 for offset-by-one drives.
>> ..
>>
>> That issue was fixed quite a while ago.
>> Upgrade your elderly copy of hdparm.
>
> Heh heh, *you* were keeping it from me! Anyways, is there hdparm
> devel tree published somewhere? I wandared the SF page for quite a
> bit (which for some reason is very difficult to find things in) but I
> couldn't find one. If it's not, it might be a good idea to put it on
> SF or git.kernel.org?
..

No tree. There's just my working copy (private),
and the published versions at SF.

But yes, SF has gotten incredibly more cryptic to use of late,
and I might have to move it somewhere more accessible soon.

Cheers!

Gabor Gombas

unread,
Mar 10, 2010, 2:10:02 AM3/10/10
to
On Tue, Mar 09, 2010 at 04:14:30PM -0800, Daniel Taylor wrote:

> I will run some experiments to see if any of the systems on my desk can boot
> Linux from a GPT.

My desktop with a BIOS from 2005 has no problems with GPT. AFAIK a
recent Debian installer automatically chooses GPT if the disk is 2 TB or
larger.

Gabor

Matthew Wilcox

unread,
Mar 10, 2010, 3:00:02 AM3/10/10
to
On Mon, Mar 08, 2010 at 10:41:57AM -0500, Martin K. Petersen wrote:
> What I meant to say was that I know ATA supports 4 KB LBS and that
> nobody appears to care about it.

I sent patches to add support ... they were ignored.

Part of the problem is that ATA is heinously broken wrt non-512 byte
sector sizes. You have to know which commands work in multiples of
the block size, and which commands work in multiples of 512-bytes.
There's no easy way to figure it out; you need a table.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Denys Vlasenko

unread,
Mar 10, 2010, 4:20:01 AM3/10/10
to
On Wed, Mar 10, 2010 at 12:46 AM, Arnd Bergmann <ar...@arndb.de> wrote:
> On Monday 08 March 2010 04:48:35 Tejun Heo wrote:
>> Unfortunately, while Windows can assume that newer releases won't
>> share the hard drive with older releases including Windows XP, Linux
>> distros can't do that. �There will be many installations where a
>> modern Linux distros share a hard drive with older releases of
>> Windows. �At this point, I can't see a silver bullet solution.
>>
>> Partitioners maybe should only align partitions which will be used by
>> Linux and default to the traditional layout for others while allowing
>> explicit override. �I think Windows XP wouldn't have problem with
>> differently aligned partitions as long as it doesn't actually use them
>> but haven't tested it.
>
> Any idea if XP can cope with partition tables that use a 32-sector, 128-head
> geometry rather than the default 63-sector, 255-head one? That seems to
> be what some flash memory cards are using and it would make any cylinder
> aligned partition also 4096-byte aligned, at the cost of moving the
> 1024-cylinder boundary from 7.88 GiB to 2 GiB.
>
> Do we know of anything that requires 63s/255h?

63s/255h is more or less "standard" now.

Alignment issues can be solved by picking a good multiple of
_heads_ or _cylinders_:

For first partition, pick the start at 8th head:

cyl 0 head 1 sector 1: LBA sector 63) - bad
cyl 0 head 8 sector 1: LBA sector 8*63) - good (4k aligned)

For any other partition, pick start cylinder which is a multiple of 8:

cyl 8*x head 0 sector 1: LBA sector 8*x*255*63 - good (4k aligned)

This will actually work well for *any* geometry, not only for 63s/255h.
--
vda

Johannes Stezenbach

unread,
Mar 10, 2010, 5:50:02 AM3/10/10
to
On Tue, Mar 09, 2010 at 04:32:04PM -0800, H. Peter Anvin wrote:
>
> It can. The BIOS doesn't care about the partition table at all --
> all it does is load the MBR.

A little story for your entertainment pleasure:

I have a Gigabyte GA-MA78GM-S2H board, and during install
turned off the power after partitioning but before formatting
any partition because I got distracted by something else.

Result: System could not boot anymore, BIOS hung before
I could get to the "select boot device" screen. This also
happened when I removed the hdd from the boot device
list in BIOS. The last BIOS message was "Verifying DMI Pool Data"
and you can find numerous similar reports by searching for
'gigabyte bios hang "Verifying DMI Pool Data"'.

In my case it worked to switch the SATA mode from AHCI to
something else, then wipe the partition table and switch
back to AHCI. But I read on the net that some people had
to format the drive in another PC, or hotplug it after the BIOS
got past "Verifying DMI Pool Data".


Johannes

H. Peter Anvin

unread,
Mar 10, 2010, 6:30:02 AM3/10/10
to
On 03/10/2010 02:46 AM, Johannes Stezenbach wrote:
> On Tue, Mar 09, 2010 at 04:32:04PM -0800, H. Peter Anvin wrote:
>>
>> It can. The BIOS doesn't care about the partition table at all --
>> all it does is load the MBR.
>
> A little story for your entertainment pleasure:
>
> I have a Gigabyte GA-MA78GM-S2H board, and during install
> turned off the power after partitioning but before formatting
> any partition because I got distracted by something else.
>
> Result: System could not boot anymore, BIOS hung before
> I could get to the "select boot device" screen. This also
> happened when I removed the hdd from the boot device
> list in BIOS. The last BIOS message was "Verifying DMI Pool Data"
> and you can find numerous similar reports by searching for
> 'gigabyte bios hang "Verifying DMI Pool Data"'.
>
> In my case it worked to switch the SATA mode from AHCI to
> something else, then wipe the partition table and switch
> back to AHCI. But I read on the net that some people had
> to format the drive in another PC, or hotplug it after the BIOS
> got past "Verifying DMI Pool Data".
>

Well, yes, there are buggy BIOSes of a gazillion varieties. A fair
number of them read the partition table to try to guess what C/H/S
geometry the user intended. However, the GPT spec specifically uses a
"Protective MBR" to guard against this and other issues like it; it
makes the entire disk look to MBR-reading software like a single fully
partitioned disk with one large partition on it.

-hpa

Jeff Garzik

unread,
Mar 10, 2010, 8:50:01 AM3/10/10
to
On 03/10/2010 02:53 AM, Matthew Wilcox wrote:
> On Mon, Mar 08, 2010 at 10:41:57AM -0500, Martin K. Petersen wrote:
>> What I meant to say was that I know ATA supports 4 KB LBS and that
>> nobody appears to care about it.
>
> I sent patches to add support ... they were ignored.

Not true, read the rest of the thread.

Jeff

Damian Lukowski

unread,
Mar 10, 2010, 11:30:03 AM3/10/10
to
> S-2. The proper solution.
>
> Correct alignments for all partitions can't be achieved by the
> firmware alone. The system utilities should be informed about the
> alignment requirements and align partitions accordingly.
>
> The above firmware workaround complicates the situation because the
> two different configurations require different offsets to achieve
> the correct alignments. ATA/ATAPI-8 specifies a way for a drive to
> export the physical and logical sector sizes and the LBA offset
> which is aligned to the physical sectors.
>
> In Linux, these parameters are exported via the following sysfs
> nodes.
>
> physical sector size : /sys/block/sdX/queue/physical_block_size
> logical sector size : /sys/block/sdX/queue/logical_block_size
> alignment offset : /sys/block/sdX/alignment_offset
>
> Let the physical sector size be PSS, logical sector size LSS and
> alignment offset AOFF. The system software should place partitions
> such that the starting LBAs of all partitions are aligned on
>
> (n * PSS + AOFF) / LSS
>
> For 4 KiB physical sector offset-by-one drives, PSS is 4096, LSS 512
> and AOFF 3584 and with n of 7 the above becomes,
>
> (7 * 4096 + 3584) / 512 == 63
>
> making sector 63 an aligned LBA where the first partition can be
> put, but without the offset-by-one mapping, AOFF is zero and LBA 63
> is not aligned.
>
> With the above new alignment requirement in place, it becomes
> difficult to honor the legacy one - first partition on sector 63 and
> all other partitions on cylinder boundary (255 * 63 sectors) - as
> the two alignment requirements contradict each other. This might be
> worked around by adjusting how LBA and CHS addresses are mapped but
> the disk geometry parameters are hard coded everywhere and there is
> no reliable way to communicate custom geometry parameters.

Hello,
I have practically no knowledge of Linux' block device drivers,
but is this really a partitioning issue? I think the problem is
with the filesystems when clustering multiple blocks without
knowledge of the sector alignment and sector size of the underlying
block device. Maybe it is a better solution to adapt the filesystem
buffer routine which reads/writes data from/to the block device?

Best regards
Damian

Henrique de Moraes Holschuh

unread,
Mar 10, 2010, 4:00:03 PM3/10/10
to
On Wed, 10 Mar 2010, Martin K. Petersen wrote:
> >>>>> "Karel" == Karel Zak <kz...@redhat.com> writes:
>
> [Cleaned up the CC: list from hell]
>
> Karel> # cat /sys/block/md8/queue/{minimum,optimal}_io_size
> Karel> 65536 65536
>
> This one had me puzzled. We set min_io and opt_io correctly in raid5.c
> depending on number of non-parity disks. And yet it turns into
> something nonsensical after.
>
> Turns out we overrun unsigned int calculating the lowest common multiple
> in the stacking function. That's why we ended up with the wrong value.
>
> I never noticed this because my userland topology regression test tool
> uses unsigned long.
>
> I'll get a patch off to Jens right away.

And please get the whole fixed deal in -stable eventually, for 2.6.32.y's
benefit :-)

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

Theodore Tso

unread,
Mar 11, 2010, 8:20:03 AM3/11/10
to
On Mar 10, 2010, at 11:19 AM, Damian Lukowski wrote:
>
> I have practically no knowledge of Linux' block device drivers,
> but is this really a partitioning issue? I think the problem is
> with the filesystems when clustering multiple blocks without
> knowledge of the sector alignment and sector size of the underlying
> block device. Maybe it is a better solution to adapt the filesystem
> buffer routine which reads/writes data from/to the block device?

No, it's really a partitioning issue. If the paging subsystem wants a 4k block to fill a particular page, we need to read that 4k block into memory. If we need to swap out that 4k block, we need to write that 4k block to swap space, or to the memory segment's backing store. If the partition is misaligned by 512 bytes, this is simply not possible. The file system has to do what is requested of it by its users, and the reality is that we need to do 4k aligned reads and writes with respect to the beginning of the partition, far more often than not.

Hence, if we want the best performance on 4k sector drives, the partition needs to be aligned with respect to what is most desirable for the device in question.

Best regards,

-- Ted

Nikanth Karthikesan

unread,
Mar 11, 2010, 9:00:02 AM3/11/10
to
On Thursday 11 March 2010 18:34:56 Theodore Tso wrote:
> On Mar 10, 2010, at 11:19 AM, Damian Lukowski wrote:
> > I have practically no knowledge of Linux' block device drivers,
> > but is this really a partitioning issue? I think the problem is
> > with the filesystems when clustering multiple blocks without
> > knowledge of the sector alignment and sector size of the underlying
> > block device. Maybe it is a better solution to adapt the filesystem
> > buffer routine which reads/writes data from/to the block device?
>
> No, it's really a partitioning issue. If the paging subsystem wants a 4k
> block to fill a particular page, we need to read that 4k block into
> memory. If we need to swap out that 4k block, we need to write that 4k
> block to swap space, or to the memory segment's backing store. If the
> partition is misaligned by 512 bytes, this is simply not possible. The
> file system has to do what is requested of it by its users, and the
> reality is that we need to do 4k aligned reads and writes with respect to
> the beginning of the partition, far more often than not.
>
> Hence, if we want the best performance on 4k sector drives, the partition
> needs to be aligned with respect to what is most desirable for the device
> in question.
>

I guess, what he meant was, to keep filesystem blocks aligned, even if the
partition is not. Say if the partition is mis-aligned by 512-bytes, let the
filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a
case of over-engineering, possibly requiring disk format change.

Thanks
Nikanth

Theodore Tso

unread,
Mar 11, 2010, 9:40:02 AM3/11/10
to

On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
>
> I guess, what he meant was, to keep filesystem blocks aligned, even if the
> partition is not. Say if the partition is mis-aligned by 512-bytes, let the
> filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a
> case of over-engineering, possibly requiring disk format change.

Ah, yes, I agree with you; that's probably what he meant.

Sure, that's theoretically possible, but it would mean changing every single filesystem, and it would require a file system format change --- or at least a file system format extension.

It would seem to be way easier to simply fix the partitioning tools to do the right thing, though.

-- Ted

Mike Snitzer

unread,
Mar 11, 2010, 9:50:02 AM3/11/10
to
On Thu, Mar 11, 2010 at 9:28 AM, Theodore Tso <ty...@mit.edu> wrote:
>
> On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
>>
>> I guess, what he meant was, to keep filesystem blocks aligned, even if the
>> partition is not. Say if the partition is mis-aligned by 512-bytes, let the
>> filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a
>> case of over-engineering, possibly requiring disk format change.
>
> Ah, yes, I agree with you; that's probably what he meant.
>
> Sure, that's theoretically possible, but it would mean changing every single filesystem, and it would require a file system format change --- or at least a file system format extension.
>
> It would seem to be way easier to simply fix the partitioning tools to do the right thing, though.

Yes, the current supported approach is to rely on partitions (parted,
fdisk) or LVM to account for 'alignment_offset'.

This avoids having a filesystem add its own padding (format change).
But e2fsprogs at least warns if a device, that it is to format, has an
alignment_offset != 0.

Mike

James Bottomley

unread,
Mar 11, 2010, 9:50:02 AM3/11/10
to
On Thu, 2010-03-11 at 09:28 -0500, Theodore Tso wrote:
> On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
> >
> > I guess, what he meant was, to keep filesystem blocks aligned, even if the
> > partition is not. Say if the partition is mis-aligned by 512-bytes, let the
> > filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a
> > case of over-engineering, possibly requiring disk format change.
>
> Ah, yes, I agree with you; that's probably what he meant.
>
> Sure, that's theoretically possible, but it would mean changing every
> single filesystem, and it would require a file system format change
> --- or at least a file system format extension.
>
> It would seem to be way easier to simply fix the partitioning tools to
> do the right thing, though.

Actually, it's a layering violation. The filesystem shouldn't need to
probe the device layout ... particularly when there are complexities
like is it logical 512 or physical, and if logical 512 on 4k does it
have an offset exponent or not.

We can transmit certain abstractions of information up the stack (like
stripe width for RAID arrays which should be the fs optimal write size),
but for this type of alignment, which can be completely solved at the
partition layer, the information should really stay there and the
filesystem should "just work".

James

Nikanth Karthikesan

unread,
Mar 11, 2010, 10:00:01 AM3/11/10
to
On Thursday 11 March 2010 19:58:11 Theodore Tso wrote:
> On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
> > I guess, what he meant was, to keep filesystem blocks aligned, even if
> > the partition is not. Say if the partition is mis-aligned by 512-bytes,
> > let the filesystem waste 4k-512bytes and keep it's blocks aligned. But it
> > might be a case of over-engineering, possibly requiring disk format
> > change.
>
> Ah, yes, I agree with you; that's probably what he meant.
>
> Sure, that's theoretically possible, but it would mean changing every
> single filesystem, and it would require a file system format change --- or
> at least a file system format extension.
>
> It would seem to be way easier to simply fix the partitioning tools to do
> the right thing, though.
>

Yes. May be, just a simple but transparent device-mapper like mapping on top
of the mis-aligned partition, to do the alignment. Then the file-system code
need not change much.

But Linux already has device-mapper and Linux will not be affected with mis-
aligned partitions, when we use LVM.

But the actual problem here is that partitioning tools might create partitions
that wont allow other operating-systems to boot. So it might be enough, if the
partitioning tools just create partitions with (mis-)alignment requirement for
Windows.

Thanks
Nikanth

Nikanth Karthikesan

unread,
Mar 11, 2010, 10:10:01 AM3/11/10
to
On Thursday 11 March 2010 20:09:34 James Bottomley wrote:
> On Thu, 2010-03-11 at 09:28 -0500, Theodore Tso wrote:
> > On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
> > > I guess, what he meant was, to keep filesystem blocks aligned, even if
> > > the partition is not. Say if the partition is mis-aligned by 512-bytes,
> > > let the filesystem waste 4k-512bytes and keep it's blocks aligned. But
> > > it might be a case of over-engineering, possibly requiring disk format
> > > change.
> >
> > Ah, yes, I agree with you; that's probably what he meant.
> >
> > Sure, that's theoretically possible, but it would mean changing every
> > single filesystem, and it would require a file system format change
> > --- or at least a file system format extension.
> >
> > It would seem to be way easier to simply fix the partitioning tools to
> > do the right thing, though.
>
> Actually, it's a layering violation. The filesystem shouldn't need to
> probe the device layout ... particularly when there are complexities
> like is it logical 512 or physical, and if logical 512 on 4k does it
> have an offset exponent or not.
>
> We can transmit certain abstractions of information up the stack (like
> stripe width for RAID arrays which should be the fs optimal write size),
> but for this type of alignment, which can be completely solved at the
> partition layer, the information should really stay there and the
> filesystem should "just work".
>

Right. It would be layering violation and we have LVM to solve it already.

The real problem, here is just that partitioning-tools should create
partitions that can work with both XP as well as Windows7. May be distro
installers, should ask the user which compatibility he needs.

Thanks
Nikanth

Tejun Heo

unread,
Mar 11, 2010, 10:10:01 AM3/11/10
to
Hello,

On 03/12/2010 12:00 AM, Nikanth Karthikesan wrote:
> But the actual problem here is that partitioning tools might create
> partitions that wont allow other operating-systems to boot. So it
> might be enough, if the partitioning tools just create partitions
> with (mis-)alignment requirement for Windows.

Turns out XP is generally OK. The reported problem was only on
specific configurations (some BIOS stuff). Windows 2000 reportedly
would be hurt but I really think we don't have to care about that too
much. So, it seems like we wouldn't have to worry too much about it
and just go ahead with new alignment schemes. I'll update the doc
this weekend with new information from this now rather large thread.

Thanks.

--
tejun

ty...@mit.edu

unread,
Mar 11, 2010, 10:30:02 AM3/11/10
to
On Thu, Mar 11, 2010 at 08:35:26PM +0530, Nikanth Karthikesan wrote:
> The real problem, here is just that partitioning-tools should create
> partitions that can work with both XP as well as Windows7. May be distro
> installers, should ask the user which compatibility he needs.

4k aligned sectors will *work* with Windows XP, will it not? It's
just simply a matter of Windows XP, being really ancient, doesn't
create properly alligned partitions by default.

And how often are we going to see Windows XP systems with these new 4k
physical sector drives anyway, where the first OS to touch the
partition is Windows XP? And in the case where this does happy, the
resulting partition will be result in terribly performance for Windows
XP as well as Linux.

What's the specific scenario which you are trying to solve, and how
likely is it to occur in real life?

- Ted

Mike Snitzer

unread,
Mar 11, 2010, 11:10:01 AM3/11/10
to
On Thu, Mar 11, 2010 at 10:00 AM, Nikanth Karthikesan <knik...@suse.de> wrote:
> On Thursday 11 March 2010 19:58:11 Theodore Tso wrote:
>> On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
>> > I guess, what he meant was, to keep filesystem blocks aligned, even if
>> > the partition is not. Say if the partition is mis-aligned by 512-bytes,
>> > let the filesystem waste 4k-512bytes and keep it's blocks aligned. But it
>> > might be a case of over-engineering, possibly requiring disk format
>> > change.
>>
>> Ah, yes, I agree with you; that's probably what he meant.
>>
>> Sure, that's theoretically possible, but it would mean changing every
>>  single filesystem, and it would require a file system format change --- or
>>  at least a file system format extension.
>>
>> It would seem to be way easier to simply fix the partitioning tools to do
>>  the right thing, though.
>>
>
> Yes. May be, just a simple but transparent device-mapper like mapping on top
> of the mis-aligned partition, to do the alignment. Then the file-system code
> need not change much.
>
> But Linux already has device-mapper and Linux will not be affected with mis-
> aligned partitions, when we use LVM.

Well, device-mapper and LVM needed to be updated to make them "just
work" but yes that work has been done.

> But the actual problem here is that partitioning tools might create partitions
> that wont allow other operating-systems to boot. So it might be enough, if the
> partitioning tools just create partitions with (mis-)alignment requirement for
> Windows.

I'm not following...

Anyway, 4K drives that are 512b logical and 4K physical may or may not
also have "DOS partition compensation" that use LBA -1 as the first
naturally (4K) aligned start. This means that the partition tools
need to shift the start of the first primary partition to be offset by
3584 bytes (7 512b sectors) for use with Linux. But for windows,
AFAIK windows XP and windows 7 create all partitions aligned on 1MB
boundaries. Linux's parted and fdisk create 1MB aligned partitions
now too.

So the only outlier is older versions of windows (< XP) and Linux (old
fdisk and parted, etc also use DOS partitioning) that don't use
naturally aligned (e.g. 1MB) partition boundaries. In those versions
of Windows and LInux there are ways to change the default start of
sector 63. That said, there is an opportunity to improve
documentation for how to workaround DOS partitioning on these
operating systems.

One other piece worth mentioning on this "IO Toplogy" support in the
entire Linux I/O Stack is the virt layers. hch has already extended
the virt-io protocol and qemu is in the finishing stages of being
updated to properly consume the "IO Topology" information. So we
really don't have any gaps in the Linux I/O stack.

mkp in particular, Jens, James, myself, and others implemented and
refined the SCSI and block changes. kzak, jim meyering, hans de
goede, hch, eric sandeen, bob peterson, myself and others updated all
other I/O stack layers ranging from DM to LVM, libblkid, fdisk, parted
to anaconda to mkfs.ext[234], mkfs.xfs, mkfs.gfs2 to virt-io and qemu.
FYI, all of these advances will be in Fedora 13 (quite a few are
already in Fedora 12).

There are obviously other Linux systems and userland tools (likely
Xen, other mkfs.* and more) that should be updated. Hopefully
maintainers and/or contributors of these projects will follow-up to
address those that need updating.

Again please see:
http://oss.oracle.com/~mkp/docs/linux-advanced-storage.pdf
http://people.redhat.com/msnitzer/docs/io-limits.txt
Some omissions include: Linux MD, which has been updated as mkp
pointed out, and I neglected to talk about virt-io and qemu (but like
I said they have been updated too).

Hopefully we're all closer to being on the same page now.

Mike

Gene Heskett

unread,
Mar 11, 2010, 11:30:02 AM3/11/10
to
On Thursday 11 March 2010, ty...@mit.edu wrote:
>On Thu, Mar 11, 2010 at 08:35:26PM +0530, Nikanth Karthikesan wrote:
>> The real problem, here is just that partitioning-tools should create
>> partitions that can work with both XP as well as Windows7. May be distro
>> installers, should ask the user which compatibility he needs.
>
>4k aligned sectors will *work* with Windows XP, will it not? It's
>just simply a matter of Windows XP, being really ancient, doesn't
>create properly alligned partitions by default.
>
>And how often are we going to see Windows XP systems with these new 4k
>physical sector drives anyway, where the first OS to touch the
>partition is Windows XP? And in the case where this does happy, the
>resulting partition will be result in terribly performance for Windows
>XP as well as Linux.
>
>What's the specific scenario which you are trying to solve, and how
>likely is it to occur in real life?

And potentially one more question from a list lurker, Ted. Where are the
tools that allow us to check and/or adjust that? I ask since I have 3 of
these terrabyte drives in this box now and have no clue how to either check
to see if we're off, or how to fix it if it is. And I have called my self
following this discussion without noting if the tools have been specifically
named.

Thanks

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

Authors are easy to get on with -- if you're fond of children.
-- Michael Joseph, "Observer"

H. Peter Anvin

unread,
Mar 11, 2010, 11:40:03 AM3/11/10
to
On 03/11/2010 05:57 AM, Nikanth Karthikesan wrote:
>
> I guess, what he meant was, to keep filesystem blocks aligned, even if the
> partition is not. Say if the partition is mis-aligned by 512-bytes, let the
> filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a
> case of over-engineering, possibly requiring disk format change.
>

That's basically what you end up having to do for FAT filesystems to be
aligned.

-hpa

Greg Freemyer

unread,
Mar 11, 2010, 11:40:03 AM3/11/10
to
On Thu, Mar 11, 2010 at 10:25 AM, <ty...@mit.edu> wrote:
> On Thu, Mar 11, 2010 at 08:35:26PM +0530, Nikanth Karthikesan wrote:
>> The real problem, here is just that partitioning-tools should create
>> partitions that can work with both XP as well as Windows7. May be distro
>> installers, should ask the user which compatibility he needs.
>
> 4k aligned sectors will *work* with Windows XP, will it not?  It's
> just simply a matter of Windows XP, being really ancient, doesn't
> create properly alligned partitions by default.
>
> And how often are we going to see Windows XP systems with these new 4k
> physical sector drives anyway, where the first OS to touch the
> partition is Windows XP?  And in the case where this does happy, the
> resulting partition will be result in terribly performance for Windows
> XP as well as Linux.
>
> What's the specific scenario which you are trying to solve, and how
> likely is it to occur in real life?
>
>                                        - Ted

Ted,

Apparently the real issue is Win2K, not XP.

It seems to require the boot partition and possibly all partitions
start on a cylinder boundary. And may have 255/63 hard-coded in to
define what a cylinder is. I agree with the apparent consensus that a
2010 era linux partitioner does not need to be Win2K compatible. If
someone wants to install Win2K they will need to either use an older
generation partitioner to create the partitions or use specific
command-line args to force a non-optimal alignment.

I do think the linux partitioners should provide a way to force a
cylinder alignment. Tejun, I would like to see your doc describe how
to force a win2k compatible partition layout.

fyi: The same issue apparently also exists for users still running OS/2.

Greg

Christoph Hellwig

unread,
Mar 11, 2010, 1:30:03 PM3/11/10
to
On Thu, Mar 11, 2010 at 11:01:41AM -0500, Mike Snitzer wrote:
> mkp in particular, Jens, James, myself, and others implemented and
> refined the SCSI and block changes. kzak, jim meyering, hans de
> goede, hch, eric sandeen, bob peterson, myself and others updated all
> other I/O stack layers ranging from DM to LVM, libblkid, fdisk, parted
> to anaconda to mkfs.ext[234], mkfs.xfs, mkfs.gfs2 to virt-io and qemu.
> FYI, all of these advances will be in Fedora 13 (quite a few are
> already in Fedora 12).

I also have some older patches for btrfs that I need to get back out
to the list. There was some talk of major changes to the organization
of the tools so I held it back for a while longer.

Tejun Heo

unread,
Mar 11, 2010, 8:20:01 PM3/11/10
to
Hello,

On 03/12/2010 01:34 AM, Greg Freemyer wrote:
> I do think the linux partitioners should provide a way to force a
> cylinder alignment. Tejun, I would like to see your doc describe how
> to force a win2k compatible partition layout.

I suppose I can play with fdisk and list it as an example but if
anyone knows better/proper way to force certain partitions to legacy
alignment while leaving others properly aligned, I'll be happy to
include it.

Thanks.

--
tejun

H. Peter Anvin

unread,
Mar 11, 2010, 10:20:01 PM3/11/10
to
I think if you use the DOS compat option to create the legacy partitions only, you should be fine.

"Tejun Heo" <t...@kernel.org> wrote:

>Hello,
>
>On 03/12/2010 01:34 AM, Greg Freemyer wrote:
>> I do think the linux partitioners should provide a way to force a
>> cylinder alignment. Tejun, I would like to see your doc describe how
>> to force a win2k compatible partition layout.
>
>I suppose I can play with fdisk and list it as an example but if
>anyone knows better/proper way to force certain partitions to legacy
>alignment while leaving others properly aligned, I'll be happy to
>include it.
>
>Thanks.
>
>--
>tejun

--
Sent from my mobile phone, pardon any lack of formatting.

Michal Soltys

unread,
Mar 14, 2010, 5:40:02 PM3/14/10
to
Tejun Heo wrote:
>
> I was thinking about testing XP booting this weekend but really want
> to avoid it, so thanks a lot for the info. I'll update the doc
> accordingly but can you please enlighten me on how it works and what's
> broken in detail? So, XP should be fine with any alignment?
>
> Thanks.
>

Sorry for late reply.

s/sp2/sp3 - although it shouldn't make a difference from sp2 onwards.

Anyway - the tests I did were because of weird laptop, where I shrinked
whole win7 stuff and having no primary partitions left to use, I tested
my usual windows xp installation I deploy with ntfsclone. Originally
that XP were installed from installation disk merged with sp3 (or how
it's usually called in windows world - slipstreamed). Of course,
windows xp itself will not present any options to install itself into
logical partition in the usual way - but during later deployment it's not
a problem to put it where one's want.

It's possible that this wouldn't work, if windows were installed first
from pre-sp2 media, and then service pack was installed (in such case,
ntldr in C:\ is not updated afaik). It's also possible, that "brute-force"
copied pre-sp2 or win2k to a partition made with either - a) xp sp2+'s disk
manager or b) mkfs.ntfs and with updated most recent ntldr - would boot as
well (the partition requirement is due to potential differences between the code
in bootsector, or more precisely - $Boot - first 8KiB of ntfs partition).

Obvious requirements besides the above (ntldr, perhaps $Boot as well) are:

- mentioned "hidden sectors" (must be manually adjusted, recent syslinux's
chain.c32 has option to do it automatically)
- adjusted boot.ini (to point to new partition, eventually other windowish
stuff as necessary)

As you can see, there're many "if"s and combinations here that I didn't test.

On a related note - ironically, while I had 0 problems making it work
through syslinux (both regular chaining and through direct ntldr loading) -
I couldn't make win7's bootmgr (bcd, bcdedit ....) do it properly. Oh well.

s ponnusa

unread,
Mar 14, 2010, 7:00:02 PM3/14/10
to
Has been following this thread and I might possibly be testing with
Windows XP soon. Will update the results.
-
SP

H. Peter Anvin

unread,
Mar 14, 2010, 9:30:02 PM3/14/10
to
On 03/10/2010 01:14 AM, Denys Vlasenko wrote:
>
> 63s/255h is more or less "standard" now.
>
> Alignment issues can be solved by picking a good multiple of
> _heads_ or _cylinders_:
>
> For first partition, pick the start at 8th head:
>
> cyl 0 head 1 sector 1: LBA sector 63) - bad
> cyl 0 head 8 sector 1: LBA sector 8*63) - good (4k aligned)
>
> For any other partition, pick start cylinder which is a multiple of 8:
>
> cyl 8*x head 0 sector 1: LBA sector 8*x*255*63 - good (4k aligned)
>
> This will actually work well for *any* geometry, not only for 63s/255h.

Yes, but it does squat for a flash disk that wants, say, 256K alignment.

-hpa

Denys Vlasenko

unread,
Mar 14, 2010, 10:30:02 PM3/14/10
to
On Monday 15 March 2010 02:21, H. Peter Anvin wrote:
> On 03/10/2010 01:14 AM, Denys Vlasenko wrote:
> >
> > 63s/255h is more or less "standard" now.
> >
> > Alignment issues can be solved by picking a good multiple of
> > _heads_ or _cylinders_:
> >
> > For first partition, pick the start at 8th head:
> >
> > cyl 0 head 1 sector 1: LBA sector 63) - bad
> > cyl 0 head 8 sector 1: LBA sector 8*63) - good (4k aligned)
> >
> > For any other partition, pick start cylinder which is a multiple of 8:
> >
> > cyl 8*x head 0 sector 1: LBA sector 8*x*255*63 - good (4k aligned)
> >
> > This will actually work well for *any* geometry, not only for 63s/255h.
>
> Yes, but it does squat for a flash disk that wants, say, 256K alignment.

4K makes sense. 256K not so much.

256K alignment is hard to swallow for a lot of reasons anyway.
Unless the filesystem packs small files into blocks a-la reiserfs,
256K block filesystems will be very inefficient for a typical
storage scenarios.

It looks like flash storage manufacturers just have to bite
the bullet and develop smarter algorithms that combine wear
leveling, block remapping and such and make their internal
preference for huge continuous aligned writes nearly invisible
from the outside - just like hard disks which do not expose
their zoned recording, variable sector counts etc.

Such algorithms aren't trivial, but they are possible.
Whoever will incorporate them in their products,
delivers a significantly better user experience.

I just played with ubuntu installation on an usb stick.
Yes, it works. Soft of. Write performance is abysmal.
I would pay x2 or x3 for the same sized stick if it
would perform better.

--
vda

Greg Freemyer

unread,
Mar 14, 2010, 11:00:01 PM3/14/10
to
> I just played with ubuntu installation on an usb stick.
> Yes, it works. Soft of. Write performance is abysmal.
> I would pay x2 or x3 for the same sized stick if it
> would perform better.

In general USB sticks don't offer the same performance as SSDs.

You can find sticks with both USB and eSata. I'd hope those offer
better performance.

You should read some performance reviews. I'm sure you can find a few
sticks that are much better than what you get from a vanilla usb
stick.

Greg

H. Peter Anvin

unread,
Mar 15, 2010, 12:10:01 AM3/15/10
to
On 03/14/2010 07:26 PM, Denys Vlasenko wrote:
>>
>> Yes, but it does squat for a flash disk that wants, say, 256K alignment.
>
> 4K makes sense. 256K not so much.
>
> 256K alignment is hard to swallow for a lot of reasons anyway.
> Unless the filesystem packs small files into blocks a-la reiserfs,
> 256K block filesystems will be very inefficient for a typical
> storage scenarios.
>

Noone has talked about using 256K filesystem blocks. The fact of the
matter, though, is that both flash and RAID have much larger alignment
requirements than a mere 4K for optimal performance.

You might not like it, but that's the way it is.

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

da...@lang.hm

unread,
Mar 15, 2010, 1:30:02 AM3/15/10
to
On Mon, 15 Mar 2010, Denys Vlasenko wrote:

> On Monday 15 March 2010 02:21, H. Peter Anvin wrote:
>> On 03/10/2010 01:14 AM, Denys Vlasenko wrote:
>>>
>>> 63s/255h is more or less "standard" now.
>>>
>>> Alignment issues can be solved by picking a good multiple of
>>> _heads_ or _cylinders_:
>>>
>>> For first partition, pick the start at 8th head:
>>>
>>> cyl 0 head 1 sector 1: LBA sector 63) - bad
>>> cyl 0 head 8 sector 1: LBA sector 8*63) - good (4k aligned)
>>>
>>> For any other partition, pick start cylinder which is a multiple of 8:
>>>
>>> cyl 8*x head 0 sector 1: LBA sector 8*x*255*63 - good (4k aligned)
>>>
>>> This will actually work well for *any* geometry, not only for 63s/255h.
>>
>> Yes, but it does squat for a flash disk that wants, say, 256K alignment.
>
> 4K makes sense. 256K not so much.
>
> 256K alignment is hard to swallow for a lot of reasons anyway.
> Unless the filesystem packs small files into blocks a-la reiserfs,
> 256K block filesystems will be very inefficient for a typical
> storage scenarios.

the thing is, if the OS can learn that it's more efficiant to write in
256K aligned chunks, then it can batch up things so that the drive doesn't
have to do a read-modify-write cycle and can instead just replace the
entire chunk.

raid arrays can benifit from this as well as SSDs.

the OS can do this when writing things to swap, flushing dirty buffers,
mmaped files, etc (in fact, if the OS knows the full contents of the
chunk, it may be more efficiant for the OS to write the entire thing then
to write part of it and have the drive/array do the read-modify-write
cycle)

David Lang

> It looks like flash storage manufacturers just have to bite
> the bullet and develop smarter algorithms that combine wear
> leveling, block remapping and such and make their internal
> preference for huge continuous aligned writes nearly invisible
> from the outside - just like hard disks which do not expose
> their zoned recording, variable sector counts etc.
>
> Such algorithms aren't trivial, but they are possible.
> Whoever will incorporate them in their products,
> delivers a significantly better user experience.
>
> I just played with ubuntu installation on an usb stick.
> Yes, it works. Soft of. Write performance is abysmal.
> I would pay x2 or x3 for the same sized stick if it
> would perform better.
>
>
--

Denys Vlasenko

unread,
Mar 15, 2010, 6:00:02 AM3/15/10
to
On Monday 15 March 2010 06:20, da...@lang.hm wrote:
> >>> For any other partition, pick start cylinder which is a multiple of 8:
> >>>
> >>> cyl 8*x head 0 sector 1: LBA sector 8*x*255*63 - good (4k aligned)
> >>>
> >>> This will actually work well for *any* geometry, not only for 63s/255h.
> >>
> >> Yes, but it does squat for a flash disk that wants, say, 256K alignment.
> >
> > 4K makes sense. 256K not so much.
> >
> > 256K alignment is hard to swallow for a lot of reasons anyway.
> > Unless the filesystem packs small files into blocks a-la reiserfs,
> > 256K block filesystems will be very inefficient for a typical
> > storage scenarios.
>
> the thing is, if the OS can learn that it's more efficiant to write in
> 256K aligned chunks, then it can batch up things so that the drive doesn't
> have to do a read-modify-write cycle and can instead just replace the
> entire chunk.

I think Linux already is doing this. The problem is, in many cases
OS can't possibly do this, short of using a specially designed
filesystem.

If you untar a Linux kernel source tarball on a seriously
fragmented ext2 filesystem, there will be a lot of discontiguous
and/or misaligned writes smaller than 256K.
Only smart firmware can help in this case.
--
vda

Arnd Bergmann

unread,
Mar 15, 2010, 8:40:02 AM3/15/10
to
On Monday 15 March 2010, H. Peter Anvin wrote:
> > 256K alignment is hard to swallow for a lot of reasons anyway.
> > Unless the filesystem packs small files into blocks a-la reiserfs,
> > 256K block filesystems will be very inefficient for a typical
> > storage scenarios.
>
> Noone has talked about using 256K filesystem blocks.

Well, logfs has just been merged and works with block sizes in that
range, but obviously only if the partition is correctly aligned.

Arnd

H. Peter Anvin

unread,
Mar 15, 2010, 10:50:01 AM3/15/10
to
On 03/15/2010 02:56 AM, Denys Vlasenko wrote:
> I think Linux already is doing this. The problem is, in many cases
> OS can't possibly do this, short of using a specially designed
> filesystem.
>
> If you untar a Linux kernel source tarball on a seriously
> fragmented ext2 filesystem, there will be a lot of discontiguous
> and/or misaligned writes smaller than 256K.
> Only smart firmware can help in this case.

Yes, but guess what... there is a lot of stupid firmware out there, and
there are lots of RAID arrays, and so on.

"Seriously fragmented" means you have already lost in the first place.

This doesn't change the fact that this is a real issue and that that is
the major reason why aligning to 63*4K is a bad idea.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

--

Tejun Heo

unread,
Mar 15, 2010, 10:40:02 PM3/15/10
to
Aieee... critical typo.

On 03/16/2010 11:30 AM, Tejun Heo wrote:
> partition table which some BIOSs actually try to do. The problem is
> that to determine the two parameters you need to equations matching
^^
two
> CHSs and LBAs

--
tejun

It is loading more messages.
0 new messages