[PATCH] block: export SSD/non-rotational queue flag through sysfs

Bartlomiej Zolnierkiewicz

unread,

Jan 5, 2009, 7:00:12 PM1/5/09

to

From: Bartlomiej Zolnierkiewicz <bzol...@gmail.com>
Subject: [PATCH] block: export SSD/non-rotational queue flag through sysfs

For some devices (i.e. CFA ATA) we can't reliably detect whether
the device is of rotational or non-rotational type so we need to
leave the final decision about this setting to the user-space.

Suggested-by: Alan Cox <al...@lxorguk.ukuu.org.uk>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzol...@gmail.com>
---
block/blk-sysfs.c | 30 +++++++++++++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)

Index: b/block/blk-sysfs.c
===================================================================
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -130,6 +130,27 @@ static ssize_t queue_max_hw_sectors_show
return queue_var_show(max_hw_sectors_kb, (page));
}

+static ssize_t queue_nonrot_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(blk_queue_nonrot(q), page);
+}
+
+static ssize_t queue_nonrot_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ unsigned long nm;
+ ssize_t ret = queue_var_store(&nm, page, count);
+
+ spin_lock_irq(q->queue_lock);
+ if (nm)
+ queue_flag_set(QUEUE_FLAG_NONROT, q);
+ else
+ queue_flag_clear(QUEUE_FLAG_NONROT, q);
+ spin_unlock_irq(q->queue_lock);
+
+ return ret;
+}
+
static ssize_t queue_nomerges_show(struct request_queue *q, char *page)
{
return queue_var_show(blk_queue_nomerges(q), page);
@@ -146,8 +167,8 @@ static ssize_t queue_nomerges_store(stru
queue_flag_set(QUEUE_FLAG_NOMERGES, q);
else
queue_flag_clear(QUEUE_FLAG_NOMERGES, q);
-
spin_unlock_irq(q->queue_lock);
+
return ret;
}

@@ -210,6 +231,12 @@ static struct queue_sysfs_entry queue_hw
.show = queue_hw_sector_size_show,
};

+static struct queue_sysfs_entry queue_nonrot_entry = {
+ .attr = {.name = "nonrot", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_nonrot_show,
+ .store = queue_nonrot_store,
+};
+
static struct queue_sysfs_entry queue_nomerges_entry = {
.attr = {.name = "nomerges", .mode = S_IRUGO | S_IWUSR },
.show = queue_nomerges_show,
@@ -229,6 +256,7 @@ static struct attribute *default_attrs[]
&queue_max_sectors_entry.attr,
&queue_iosched_entry.attr,
&queue_hw_sector_size_entry.attr,
+ &queue_nonrot_entry.attr,
&queue_nomerges_entry.attr,
&queue_rq_affinity_entry.attr,
NULL,

Alan Cox

unread,

Jan 5, 2009, 7:00:19 PM1/5/09

to

On Mon, 5 Jan 2009 19:52:57 +0100
Bartlomiej Zolnierkiewicz <bzol...@gmail.com> wrote:

> From: Bartlomiej Zolnierkiewicz <bzol...@gmail.com>
> Subject: [PATCH] block: export SSD/non-rotational queue flag through sysfs
>
> For some devices (i.e. CFA ATA) we can't reliably detect whether
> the device is of rotational or non-rotational type so we need to
> leave the final decision about this setting to the user-space.
>
> Suggested-by: Alan Cox <al...@lxorguk.ukuu.org.uk>

Nice - exactly what is needed

Acked-by: Alan Cox <al...@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jens Axboe

unread,

Jan 5, 2009, 7:00:20 PM1/5/09

to

On Mon, Jan 05 2009, Bartlomiej Zolnierkiewicz wrote:
> From: Bartlomiej Zolnierkiewicz <bzol...@gmail.com>
> Subject: [PATCH] block: export SSD/non-rotational queue flag through sysfs
>
> For some devices (i.e. CFA ATA) we can't reliably detect whether
> the device is of rotational or non-rotational type so we need to
> leave the final decision about this setting to the user-space.

I agree with that, was actually planning on doing that myself.

> @@ -146,8 +167,8 @@ static ssize_t queue_nomerges_store(stru
> queue_flag_set(QUEUE_FLAG_NOMERGES, q);
> else
> queue_flag_clear(QUEUE_FLAG_NOMERGES, q);
> -
> spin_unlock_irq(q->queue_lock);
> +
> return ret;
> }
>

Hmm?

> @@ -210,6 +231,12 @@ static struct queue_sysfs_entry queue_hw
> .show = queue_hw_sector_size_show,
> };
>
> +static struct queue_sysfs_entry queue_nonrot_entry = {
> + .attr = {.name = "nonrot", .mode = S_IRUGO | S_IWUSR },
> + .show = queue_nonrot_show,
> + .store = queue_nonrot_store,
> +};
> +

Lets please use a better name for export reasons, non-rotational is a
lot better. Nobody will know what nonrot means :-)

--
Jens Axboe

Bartlomiej Zolnierkiewicz

unread,

Jan 5, 2009, 7:10:12 PM1/5/09

to

On Monday 05 January 2009, Jens Axboe wrote:
> On Mon, Jan 05 2009, Bartlomiej Zolnierkiewicz wrote:
> > From: Bartlomiej Zolnierkiewicz <bzol...@gmail.com>
> > Subject: [PATCH] block: export SSD/non-rotational queue flag through sysfs
> >
> > For some devices (i.e. CFA ATA) we can't reliably detect whether
> > the device is of rotational or non-rotational type so we need to
> > leave the final decision about this setting to the user-space.
>
> I agree with that, was actually planning on doing that myself.
>
> > @@ -146,8 +167,8 @@ static ssize_t queue_nomerges_store(stru
> > queue_flag_set(QUEUE_FLAG_NOMERGES, q);
> > else
> > queue_flag_clear(QUEUE_FLAG_NOMERGES, q);
> > -
> > spin_unlock_irq(q->queue_lock);
> > +
> > return ret;
> > }
> >
>
> Hmm?

This is just a "bonus". :)

> > @@ -210,6 +231,12 @@ static struct queue_sysfs_entry queue_hw
> > .show = queue_hw_sector_size_show,
> > };
> >
> > +static struct queue_sysfs_entry queue_nonrot_entry = {
> > + .attr = {.name = "nonrot", .mode = S_IRUGO | S_IWUSR },
> > + .show = queue_nonrot_show,
> > + .store = queue_nonrot_store,
> > +};
> > +
>
> Lets please use a better name for export reasons, non-rotational is a
> lot better. Nobody will know what nonrot means :-)

Yeah...

From: Bartlomiej Zolnierkiewicz <bzol...@gmail.com>
Subject: [PATCH v2] block: export SSD/non-rotational queue flag through sysfs

For some devices (i.e. CFA ATA) we can't reliably detect whether
the device is of rotational or non-rotational type so we need to
leave the final decision about this setting to the user-space.

As a bonus do a minor CodingStyle fixup in queue_nomerges_store().

@@ -146,8 +167,8 @@ static ssize_t queue_nomerges_store(stru
queue_flag_set(QUEUE_FLAG_NOMERGES, q);
else
queue_flag_clear(QUEUE_FLAG_NOMERGES, q);
-
spin_unlock_irq(q->queue_lock);
+
return ret;
}

@@ -210,6 +231,12 @@ static struct queue_sysfs_entry queue_hw
.show = queue_hw_sector_size_show,
};

+static struct queue_sysfs_entry queue_nonrot_entry = {

+ .attr = {.name = "non-rotational", .mode = S_IRUGO | S_IWUSR },

+ .show = queue_nonrot_show,
+ .store = queue_nonrot_store,
+};
+

static struct queue_sysfs_entry queue_nomerges_entry = {
.attr = {.name = "nomerges", .mode = S_IRUGO | S_IWUSR },
.show = queue_nomerges_show,
@@ -229,6 +256,7 @@ static struct attribute *default_attrs[]
&queue_max_sectors_entry.attr,
&queue_iosched_entry.attr,
&queue_hw_sector_size_entry.attr,
+ &queue_nonrot_entry.attr,
&queue_nomerges_entry.attr,
&queue_rq_affinity_entry.attr,
NULL,

James Bottomley

unread,

Jan 5, 2009, 7:10:14 PM1/5/09

to

Almost all block devices might find this useful. Ought not this flag to
be in the block layer appearing under /sys/block/<dev>?

James

Jens Axboe

unread,

Jan 5, 2009, 7:20:08 PM1/5/09

to

On Mon, Jan 05 2009, Bartlomiej Zolnierkiewicz wrote:
> On Monday 05 January 2009, Jens Axboe wrote:
> > On Mon, Jan 05 2009, Bartlomiej Zolnierkiewicz wrote:
> > > From: Bartlomiej Zolnierkiewicz <bzol...@gmail.com>
> > > Subject: [PATCH] block: export SSD/non-rotational queue flag through sysfs
> > >
> > > For some devices (i.e. CFA ATA) we can't reliably detect whether
> > > the device is of rotational or non-rotational type so we need to
> > > leave the final decision about this setting to the user-space.
> >
> > I agree with that, was actually planning on doing that myself.
> >
> > > @@ -146,8 +167,8 @@ static ssize_t queue_nomerges_store(stru
> > > queue_flag_set(QUEUE_FLAG_NOMERGES, q);
> > > else
> > > queue_flag_clear(QUEUE_FLAG_NOMERGES, q);
> > > -
> > > spin_unlock_irq(q->queue_lock);
> > > +
> > > return ret;
> > > }
> > >
> >
> > Hmm?
>
> This is just a "bonus". :)

I typically prefer a line in-between, but apparently blk-sysfs uses the
other style, so it's good to at least be consistent locally :-)

I've applied the updated patch, thanks!

--
Jens Axboe

Jens Axboe

unread,

Jan 5, 2009, 7:20:14 PM1/5/09

to

The block layer only uses it for scheduling purposes, in which case it
fits well in queue/ since you need the io scheduler.

--
Jens Axboe

Kay Sievers

unread,

Jan 5, 2009, 9:50:08 PM1/5/09

to

On Mon, Jan 5, 2009 at 19:54, Jens Axboe <jens....@oracle.com> wrote:
> On Mon, Jan 05 2009, Bartlomiej Zolnierkiewicz wrote:

>> +static struct queue_sysfs_entry queue_nonrot_entry = {
>> + .attr = {.name = "nonrot", .mode = S_IRUGO | S_IWUSR },
>> + .show = queue_nonrot_show,
>> + .store = queue_nonrot_store,
>> +};
>> +
>
> Lets please use a better name for export reasons, non-rotational is a
> lot better. Nobody will know what nonrot means :-)

What's that negation good for? Can't we just have "rotational", like
we have "removable" and not "non-removable"? :)

Thanks?
Kay

Sitsofe Wheeler

unread,

Jan 5, 2009, 10:20:11 PM1/5/09

to

Kay Sievers wrote:
> On Mon, Jan 5, 2009 at 19:54, Jens Axboe <jens....@oracle.com> wrote:
>> On Mon, Jan 05 2009, Bartlomiej Zolnierkiewicz wrote:
>
>>> +static struct queue_sysfs_entry queue_nonrot_entry = {
>>> + .attr = {.name = "nonrot", .mode = S_IRUGO | S_IWUSR },
>>> + .show = queue_nonrot_show,
>>> + .store = queue_nonrot_store,
>>> +};
>>> +
>> Lets please use a better name for export reasons, non-rotational is a
>> lot better. Nobody will know what nonrot means :-)
>
> What's that negation good for? Can't we just have "rotational", like
> we have "removable" and not "non-removable"? :)

How about cheapseek? fastrandom? flash? ssd? However the internal flag
is called QUEUE_FLAG_NONROT so it kind of makes sense just to leave it
as nonrot...

Stefan Richter

unread,

Jan 6, 2009, 1:30:08 AM1/6/09

to

Sitsofe Wheeler wrote:
> How about cheapseek? fastrandom? flash? ssd? However the internal flag
> is called QUEUE_FLAG_NONROT so it kind of makes sense just to leave it
> as nonrot...

It is not necessary to obfuscate an interface to userspace just because
some internally used cpp macro got an awkward name.
--
Stefan Richter
-=====-==--= ---= --==-
http://arcgraph.de/sr/

Jens Axboe

unread,

Jan 6, 2009, 7:40:05 AM1/6/09

to

On Mon, Jan 05 2009, Kay Sievers wrote:
> On Mon, Jan 5, 2009 at 19:54, Jens Axboe <jens....@oracle.com> wrote:
> > On Mon, Jan 05 2009, Bartlomiej Zolnierkiewicz wrote:
>
> >> +static struct queue_sysfs_entry queue_nonrot_entry = {
> >> + .attr = {.name = "nonrot", .mode = S_IRUGO | S_IWUSR },
> >> + .show = queue_nonrot_show,
> >> + .store = queue_nonrot_store,
> >> +};
> >> +
> >
> > Lets please use a better name for export reasons, non-rotational is a
> > lot better. Nobody will know what nonrot means :-)
>
> What's that negation good for? Can't we just have "rotational", like
> we have "removable" and not "non-removable"? :)

Non-rotational is the term typically used, since rotational is the norm
(still). So I think the negation actually makes sense in this case :-)

--
Jens Axboe

Hugh Dickins

unread,

Jan 6, 2009, 7:50:04 PM1/6/09

to

On Mon, 5 Jan 2009, Sitsofe Wheeler wrote:
> Kay Sievers wrote:
> > On Mon, Jan 5, 2009 at 19:54, Jens Axboe <jens....@oracle.com> wrote:
> > > On Mon, Jan 05 2009, Bartlomiej Zolnierkiewicz wrote:
> >
> > > > +static struct queue_sysfs_entry queue_nonrot_entry = {
> > > > + .attr = {.name = "nonrot", .mode = S_IRUGO | S_IWUSR },
> > > > + .show = queue_nonrot_show,
> > > > + .store = queue_nonrot_store,
> > > > +};
> > > > +
> > > Lets please use a better name for export reasons, non-rotational is a
> > > lot better. Nobody will know what nonrot means :-)
> >
> > What's that negation good for? Can't we just have "rotational", like
> > we have "removable" and not "non-removable"? :)
>
> How about cheapseek? fastrandom? flash? ssd? However the internal flag is
> called QUEUE_FLAG_NONROT so it kind of makes sense just to leave it as
> nonrot...

Many thanks to Bart (and Alan) for this patch:
just what I'd been hoping for when I wrote (25 Nov)
But how to get my SD card, accessed by USB card reader, reported as NONROT?

However, may I join Kai in protesting the negative boolean flag?
I've never quite recovered from CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER;
and while this is obviously nowhere near the same league, I'd much
rather stay well away.

I don't like the "nonrot" or "rotational" at all myself: though you
may be able to advance convincing arguments why it's not accidental,
isn't the rotationality pretty much incidental to whether the seeks
are cheap? I imagine a long bar loaded with data, which zips back
and forth through the reader: nothing rotational, but expensive seeks.
(I had been going to propose magnetic tape, but the rotation of the
spools muddies that argument.)

When doing the swap patch, though I toed the line with SWP_NONROT
for quite a while (I do dislike using different names for the same
notion at different levels), I couldn't stomach it in the end, and
went for SWP_SOLIDSTATE.

But I particularly like Sitsofe's "cheapseek". Is that is an accurate
representation of how the I/O schedulers treat it? then please can we
name the user-visible sysfs file accordingly?

The kernel-internal name is much less important, though I'd be pretty
happy to have CHEAPSEEK instead of NONROT throughout there too.

Oh, another problem with NONROT: flash rots a lot sooner than disk,
doesn't it?

Hugh

Michael Tokarev

unread,

Jan 7, 2009, 10:50:05 AM1/7/09

to

Jens Axboe wrote:
> On Mon, Jan 05 2009, Kay Sievers wrote:
>> On Mon, Jan 5, 2009 at 19:54, Jens Axboe <jens....@oracle.com> wrote:
>>> On Mon, Jan 05 2009, Bartlomiej Zolnierkiewicz wrote:
>>>> +static struct queue_sysfs_entry queue_nonrot_entry = {
>>>> + .attr = {.name = "nonrot", .mode = S_IRUGO | S_IWUSR },
>>>> + .show = queue_nonrot_show,
>>>> + .store = queue_nonrot_store,
>>>> +};
>>>> +
>>> Lets please use a better name for export reasons, non-rotational is a
>>> lot better. Nobody will know what nonrot means :-)
>> What's that negation good for? Can't we just have "rotational", like
>> we have "removable" and not "non-removable"? :)
>
> Non-rotational is the term typically used, since rotational is the norm
> (still). So I think the negation actually makes sense in this case :-)

You used the word "still" yourself. I mean, in 5 years SSD will be more
common than rotational media, and "the norm" will be !rotational..
So let's name them correctly and uniformly from the beginning.. ;)

/mjt

Jens Axboe

unread,

Jan 7, 2009, 11:30:10 AM1/7/09

to

On Wed, Jan 07 2009, Michael Tokarev wrote:
> Jens Axboe wrote:
> > On Mon, Jan 05 2009, Kay Sievers wrote:
> >> On Mon, Jan 5, 2009 at 19:54, Jens Axboe <jens....@oracle.com> wrote:
> >>> On Mon, Jan 05 2009, Bartlomiej Zolnierkiewicz wrote:
> >>>> +static struct queue_sysfs_entry queue_nonrot_entry = {
> >>>> + .attr = {.name = "nonrot", .mode = S_IRUGO | S_IWUSR },
> >>>> + .show = queue_nonrot_show,
> >>>> + .store = queue_nonrot_store,
> >>>> +};
> >>>> +
> >>> Lets please use a better name for export reasons, non-rotational is a
> >>> lot better. Nobody will know what nonrot means :-)
> >> What's that negation good for? Can't we just have "rotational", like
> >> we have "removable" and not "non-removable"? :)
> >
> > Non-rotational is the term typically used, since rotational is the norm
> > (still). So I think the negation actually makes sense in this case :-)
>
> You used the word "still" yourself. I mean, in 5 years SSD will be more
> common than rotational media, and "the norm" will be !rotational..
> So let's name them correctly and uniformly from the beginning.. ;)

Not sure I agree with your SSD adoption rate, but that's beside the
point :-)

But I'm inclined to agree with you and Kay after all, lets just call it
'rotational'. I think alternatives like 'cheapseek' are truly horrible,
nobody will know what that means - how cheap is cheap?

I'll fixup the patch.

--
Jens Axboe

James Bottomley

unread,

Jan 7, 2009, 3:40:13 PM1/7/09

to

On Wed, 2009-01-07 at 13:39 +0300, Michael Tokarev wrote:
> Jens Axboe wrote:
> > On Mon, Jan 05 2009, Kay Sievers wrote:
> >> On Mon, Jan 5, 2009 at 19:54, Jens Axboe <jens....@oracle.com> wrote:
> >>> On Mon, Jan 05 2009, Bartlomiej Zolnierkiewicz wrote:
> >>>> +static struct queue_sysfs_entry queue_nonrot_entry = {
> >>>> + .attr = {.name = "nonrot", .mode = S_IRUGO | S_IWUSR },
> >>>> + .show = queue_nonrot_show,
> >>>> + .store = queue_nonrot_store,
> >>>> +};
> >>>> +
> >>> Lets please use a better name for export reasons, non-rotational is a
> >>> lot better. Nobody will know what nonrot means :-)
> >> What's that negation good for? Can't we just have "rotational", like
> >> we have "removable" and not "non-removable"? :)
> >
> > Non-rotational is the term typically used, since rotational is the norm
> > (still). So I think the negation actually makes sense in this case :-)
>
> You used the word "still" yourself. I mean, in 5 years SSD will be more
> common than rotational media, and "the norm" will be !rotational..
> So let's name them correctly and uniformly from the beginning.. ;)

I'm afraid that's pretty much marketing coolaid. Rotational storage
will dominate for the forseeable future: just do a simple back of the
envelope calculation:

* Total shipped spinning storage in 2008: 128EB (Gartner figures)
* If all the chip FABs in all the world were turned solely to
flash production, they'd be able to ship about 16EB per year (or
about 13% of the total 2008 consumption [estimate from Steve
Hetzler, IBM]), assuming any given FAB can produce about 0.5EB
per year).
* Then factor in that storage requirements are growing
exponentially (the 2009 estimated requirement is 400EB).
* Given that it costs ~$3-7bn for one fab plant and that we're in
an economic depression, no-one is making the necessary trillion
dollar capital investment to even make Flash be a significant
fraction of the current storage market ... it's not even clear
that it will be able to break the 1% barrier, let alone the 10%
one.

That's not to say flash isn't important, it is; it's just to remind
everyone that storage subsystems will be focused on rotating media. Any
flash features we add have to make sure they don't impact our rotating
performance.

Sorry, now we can go back to regularly scheduled programming. This
isn't the first "everything will be flash and we should be planning for
it" type statement, but I thought a little reality injection would be
appropriate at this juncture.

James

Tejun Heo

unread,

Jan 15, 2009, 5:40:07 AM1/15/09

to

Hello,

James Bottomley wrote:
> I'm afraid that's pretty much marketing coolaid. Rotational storage
> will dominate for the forseeable future: just do a simple back of the
> envelope calculation:

Or just compare prices per byte of memory, flash and rotation disk.
They haven't had changed too much during last several years.
Secondary storage which is only slightly cheaper than the primary
storage doesn't have much chance of flying high and far.

Thanks.

--
tejun

Greg Freemyer

unread,

Jan 15, 2009, 3:10:08 PM1/15/09

to

On Thu, Jan 15, 2009 at 12:37 AM, Tejun Heo <hte...@gmail.com> wrote:
> Hello,
>
> James Bottomley wrote:
>> I'm afraid that's pretty much marketing coolaid. Rotational storage
>> will dominate for the forseeable future: just do a simple back of the
>> envelope calculation:
>
> Or just compare prices per byte of memory, flash and rotation disk.
> They haven't had changed too much during last several years.
> Secondary storage which is only slightly cheaper than the primary
> storage doesn't have much chance of flying high and far.
>
> Thanks.
>
> --
> tejun

Have you seen the new pricing Samsung has announced for their 3rd
generation SSD. It is about 1/3 of the Intel' SSD price if I recall
correctly and the performance is approaching Intel's from what I've
seen.

I've been talking to the OpenHSM (Hierachical Storage Manager) team
about their project. They are working on getting the logic in place
now to move data blocks from one class of storage to another while
leaving the filesystem itself un-affected from the users perspective.

http://code.google.com/p/fscops/

They have a very long way to go with their code/project, but it is
conceptually similar to the ext4_defrag patch that already exists.
The big difference is the data block allocation algorithm will have to
be totally different.

If and when that get their code done, I would love to have 500 GB of
SSD teamed with several TB of rotational HDD and have the HSM move my
files between fast SSD and slow rotational. I typically know which
datasets I will be working with heavily, so even a simple user space
tool that would let me adjust which tier of storage my files were
sitting on would suffice.

Greg
--
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

Tejun Heo

unread,

Jan 15, 2009, 3:50:06 PM1/15/09

to

Greg Freemyer wrote:
>> Or just compare prices per byte of memory, flash and rotation disk.
>> They haven't had changed too much during last several years.

I meant the ratio of prices here.

>> Secondary storage which is only slightly cheaper than the primary
>> storage doesn't have much chance of flying high and far.

>> tejun
>
> Have you seen the new pricing Samsung has announced for their 3rd
> generation SSD. It is about 1/3 of the Intel' SSD price if I recall
> correctly and the performance is approaching Intel's from what I've
> seen.

I compare the prices from time to time (about once a year I think) and
the difference has been usually much higher than 20 times if my memory
serves me right. Intel SSDs are on pretty expensive side, even 1/3 of
that price means > 20 times price difference per byte. If you compare
that price with main memory, it's only ~3.5 times cheaper.

> I've been talking to the OpenHSM (Hierachical Storage Manager) team
> about their project. They are working on getting the logic in place
> now to move data blocks from one class of storage to another while
> leaving the filesystem itself un-affected from the users perspective.
>
> http://code.google.com/p/fscops/
>
> They have a very long way to go with their code/project, but it is
> conceptually similar to the ext4_defrag patch that already exists.
> The big difference is the data block allocation algorithm will have to
> be totally different.
>
> If and when that get their code done, I would love to have 500 GB of
> SSD teamed with several TB of rotational HDD and have the HSM move my
> files between fast SSD and slow rotational. I typically know which
> datasets I will be working with heavily, so even a simple user space
> tool that would let me adjust which tier of storage my files were
> sitting on would suffice.

I'd love that too. For areas where data size doesn't grow
exponentially, it is and will continue to be great and be getting even
better, but I'm just not sure whether it will rise as the mainstream
secondary storage in foreseeable future given the price discrepancy
but I'll be happy to be taken by surprise. :-)

Thanks.

--
tejun

James Bottomley

unread,

Jan 15, 2009, 4:10:07 PM1/15/09

to

On Thu, 2009-01-15 at 10:07 -0500, Greg Freemyer wrote:
> On Thu, Jan 15, 2009 at 12:37 AM, Tejun Heo <hte...@gmail.com> wrote:
> > Hello,
> >
> > James Bottomley wrote:
> >> I'm afraid that's pretty much marketing coolaid. Rotational storage
> >> will dominate for the forseeable future: just do a simple back of the
> >> envelope calculation:
> >
> > Or just compare prices per byte of memory, flash and rotation disk.
> > They haven't had changed too much during last several years.
> > Secondary storage which is only slightly cheaper than the primary
> > storage doesn't have much chance of flying high and far.
> >
> > Thanks.
> >
> > --
> > tejun
>
> Have you seen the new pricing Samsung has announced for their 3rd
> generation SSD. It is about 1/3 of the Intel' SSD price if I recall
> correctly and the performance is approaching Intel's from what I've
> seen.

I think the point Tejun and I are trying to make is that current SSD
pricing is an artefact of the fact that there's currently a world
surplus of flash components and that today SSD production is tiny. If
SSD production rises, the demand pressure will force flash prices back
to normal (or even above if manufacturers can charge a premium) and
you'll see SSDs priced much higher than rotational storage.

That's not to say that no-one should be buying todays cheap flash
storage ... just a warning that it can't last if SSDs become hugely
popular.

> I've been talking to the OpenHSM (Hierachical Storage Manager) team
> about their project. They are working on getting the logic in place
> now to move data blocks from one class of storage to another while
> leaving the filesystem itself un-affected from the users perspective.
>
> http://code.google.com/p/fscops/
>
> They have a very long way to go with their code/project, but it is
> conceptually similar to the ext4_defrag patch that already exists.
> The big difference is the data block allocation algorithm will have to
> be totally different.
>
> If and when that get their code done, I would love to have 500 GB of
> SSD teamed with several TB of rotational HDD and have the HSM move my
> files between fast SSD and slow rotational. I typically know which
> datasets I will be working with heavily, so even a simple user space
> tool that would let me adjust which tier of storage my files were
> sitting on would suffice.

Right, this is the flash cache idea that's been floating around for a
while. It's even been suggested as a way of avoiding the IDE barrier
flush penalties. I think Seagate went as far as making hybrid drives
that were a large flash cache backed by rotational storage.

James

Grant Grundler

unread,

Jan 15, 2009, 7:00:15 PM1/15/09

to

On Thu, Jan 15, 2009 at 8:06 AM, James Bottomley
<James.B...@hansenpartnership.com> wrote:
...

> Right, this is the flash cache idea that's been floating around for a
> while. It's even been suggested as a way of avoiding the IDE barrier
> flush penalties. I think Seagate went as far as making hybrid drives
> that were a large flash cache backed by rotational storage.

"inline" RAM caches were implemented 10+ years ago for SCSI devices
to avoid seek *and* rotational delay penalties. These were transperent
SCSI devices that would cache and prefetch content from disks.
For the given dataset size, the seek cost was effectively zero. This is
mostly true today for high-end disk arrays which have large RAM
front-end's (TB) and very smart data placement strategies (reduced seek).

Back to the bikeshed painting: I prefer two flags: "AVGREADCOST" and
"AVGWRITECOST as the flag. This isn't just about seek. Rotational delay
@7200 RPM is 8.3ms. Making both non-zero implies rotational media.
Different devices have different read and write characteristics.
Flash typically has a much higher write cost than read cost.
Disks (because of WCE) can be the opposite.

Code can test for zero/nonzero or (preferably) more fine grained.
e.g. "avgreadcost > 1ms" or "avgwritecost". I'm hoping this test
can be abstracted into a macro.

I'm hoping longterm, the values could be "self tuning" but don't know
how that might work - e.g. 1 minute avg? 10 minute avg? Cost
of collecting/maintaining the stats? Feels like a CONFIG option.

hth,
grant

James Bottomley

unread,

Jan 15, 2009, 7:10:19 PM1/15/09

to

On Thu, 2009-01-15 at 10:55 -0800, Grant Grundler wrote:
> On Thu, Jan 15, 2009 at 8:06 AM, James Bottomley
> <James.B...@hansenpartnership.com> wrote:
> ...
> > Right, this is the flash cache idea that's been floating around for a
> > while. It's even been suggested as a way of avoiding the IDE barrier
> > flush penalties. I think Seagate went as far as making hybrid drives
> > that were a large flash cache backed by rotational storage.
>
> "inline" RAM caches were implemented 10+ years ago for SCSI devices
> to avoid seek *and* rotational delay penalties. These were transperent
> SCSI devices that would cache and prefetch content from disks.
> For the given dataset size, the seek cost was effectively zero. This is
> mostly true today for high-end disk arrays which have large RAM
> front-end's (TB) and very smart data placement strategies (reduced seek).

RAM caches are different from flash caches: they're volatile.
Essentially they're the cause of our barrier flush issues. The point
about flash caches is they're non volatile. You can pull power from the
disk as soon as the data is in flash cache; the disc can commit it again
as soon as it powers up.

> Back to the bikeshed painting: I prefer two flags: "AVGREADCOST" and
> "AVGWRITECOST as the flag. This isn't just about seek. Rotational delay
> @7200 RPM is 8.3ms. Making both non-zero implies rotational media.
> Different devices have different read and write characteristics.
> Flash typically has a much higher write cost than read cost.
> Disks (because of WCE) can be the opposite.
>
> Code can test for zero/nonzero or (preferably) more fine grained.
> e.g. "avgreadcost > 1ms" or "avgwritecost". I'm hoping this test
> can be abstracted into a macro.

Um these really have to be things we can get out of the device at boot
time without effort (as in part of the data the device can give in a
single command). I'll be shot for increasing boot time so we can work
out these parameters ...

> I'm hoping longterm, the values could be "self tuning" but don't know
> how that might work - e.g. 1 minute avg? 10 minute avg? Cost
> of collecting/maintaining the stats? Feels like a CONFIG option.

CONFIG_SLOW_YOUR_BOOT?

James

Grant Grundler

unread,

Jan 15, 2009, 11:00:24 PM1/15/09

to

On Thu, Jan 15, 2009 at 11:00 AM, James Bottomley
<James.B...@hansenpartnership.com> wrote:
...

>> Code can test for zero/nonzero or (preferably) more fine grained.
>> e.g. "avgreadcost > 1ms" or "avgwritecost". I'm hoping this test
>> can be abstracted into a macro.
>
> Um these really have to be things we can get out of the device at boot
> time without effort (as in part of the data the device can give in a
> single command). I'll be shot for increasing boot time so we can work
> out these parameters ...

No. The whole point is we should not care what it is at boot time.
It should be based on recent history of what is going on.
At boot time we read the partition table and we superblocks to mount
file systems.
That's fine to start with. So I don't see any need to add some synthetic test
to establish initial values.

The rest of the code should work regardless of what the values start out to be.
This is true for the previous proposed patch too when user space has
to decide what the right policy is.

>> I'm hoping longterm, the values could be "self tuning" but don't know
>> how that might work - e.g. 1 minute avg? 10 minute avg? Cost
>> of collecting/maintaining the stats? Feels like a CONFIG option.
>
> CONFIG_SLOW_YOUR_BOOT?

maybe CONFIG_AUTOTUNE_RWCOST.

thanks,
grant

James Bottomley

unread,

Jan 15, 2009, 11:20:11 PM1/15/09

to

On Thu, 2009-01-15 at 14:45 -0800, Grant Grundler wrote:
> On Thu, Jan 15, 2009 at 11:00 AM, James Bottomley
> <James.B...@hansenpartnership.com> wrote:
> ...
> >> Code can test for zero/nonzero or (preferably) more fine grained.
> >> e.g. "avgreadcost > 1ms" or "avgwritecost". I'm hoping this test
> >> can be abstracted into a macro.
> >
> > Um these really have to be things we can get out of the device at boot
> > time without effort (as in part of the data the device can give in a
> > single command). I'll be shot for increasing boot time so we can work
> > out these parameters ...
>
> No. The whole point is we should not care what it is at boot time.
> It should be based on recent history of what is going on.
> At boot time we read the partition table and we superblocks to mount
> file systems.
> That's fine to start with. So I don't see any need to add some synthetic test
> to establish initial values.
>
> The rest of the code should work regardless of what the values start out to be.
> This is true for the previous proposed patch too when user space has
> to decide what the right policy is.

OK, so they could be calculated on the fly in the elevators, I suppose.
But what would the value be? Right now we use the nonrotational flag to
basically not bother with plugging (no point if no seek penalty) on
certain events where we'd previously have waited for other I/O to join.
But that's really a seek penalty parameter rather than the idea of read
or write costing (although the elevators usually track these dynamically
anyway ... as part of the latency calculations but not explicitly).

James

Michael Tokarev

unread,

Jan 16, 2009, 12:00:09 AM1/16/09

to

Hugh Dickins wrote:

> On Thu, 15 Jan 2009, James Bottomley wrote:
>> OK, so they could be calculated on the fly in the elevators, I suppose.
>> But what would the value be? Right now we use the nonrotational flag to
>> basically not bother with plugging (no point if no seek penalty) on
>> certain events where we'd previously have waited for other I/O to join.
>> But that's really a seek penalty parameter rather than the idea of read
>> or write costing (although the elevators usually track these dynamically
>> anyway ... as part of the latency calculations but not explicitly).
>

> ... not bother with plugging (no point if no seek penalty) ...
>
> I thought there was considerable advantage to plugging writes
> (in case they turn out to be adjacent) on current and older
> generations of non-rotational storage?

I think it's about collecting the whole eraseblock if possible - speaking
of NAND flashes for example.

But I also think that the percentage of whole eraseblocks during writes
will be very low regardless of any plugging, UNLESS the filesystem layout
is optimized especially for that. So such "plugging" is somewhat useless
here - again, unless an application will perform a lot of singel-byte writes
like f.e. "mscompress" version 0.3 does... (But we honor O_SYNC so this
case is abusable anyway.)

/mjt
>
> Hugh

Hugh Dickins

unread,

Jan 16, 2009, 12:00:13 AM1/16/09

to

On Thu, 15 Jan 2009, James Bottomley wrote:
>

> OK, so they could be calculated on the fly in the elevators, I suppose.
> But what would the value be? Right now we use the nonrotational flag to
> basically not bother with plugging (no point if no seek penalty) on
> certain events where we'd previously have waited for other I/O to join.
> But that's really a seek penalty parameter rather than the idea of read
> or write costing (although the elevators usually track these dynamically
> anyway ... as part of the latency calculations but not explicitly).

... not bother with plugging (no point if no seek penalty) ...

I thought there was considerable advantage to plugging writes
(in case they turn out to be adjacent) on current and older
generations of non-rotational storage?

Hugh

James Bottomley

unread,

Jan 16, 2009, 12:40:06 AM1/16/09

to

On Thu, 2009-01-15 at 23:50 +0000, Hugh Dickins wrote:
> On Thu, 15 Jan 2009, James Bottomley wrote:
> >
> > OK, so they could be calculated on the fly in the elevators, I suppose.
> > But what would the value be? Right now we use the nonrotational flag to
> > basically not bother with plugging (no point if no seek penalty) on
> > certain events where we'd previously have waited for other I/O to join.
> > But that's really a seek penalty parameter rather than the idea of read
> > or write costing (although the elevators usually track these dynamically
> > anyway ... as part of the latency calculations but not explicitly).
>
> ... not bother with plugging (no point if no seek penalty) ...
>
> I thought there was considerable advantage to plugging writes
> (in case they turn out to be adjacent) on current and older
> generations of non-rotational storage?

Heh, you get as many answers to that one as their are SSD manufacturers.
However, the consensus seems to be that all MLC and SLC flash has a RAID
like architecture internally, thus it can actually be *faster* if you
send multiple commands (each area of the RAID processes independently).
Of course, you have to be *able* to send multiple commands, so the
device must implement TCQ/NCQ, but if it does, it's actually beneficial
*not* to wait even if the requests are adjacent.

However, the reason the nonrotational flag is set from user space is
precisely so if we do find an SSD that has this property, we can just
not set the nonrotational queue flag.

James

Dongjun Shin

unread,

Jan 16, 2009, 4:00:12 AM1/16/09

to

Not all non-rotational SSDs are created equal (as Intel said).

Some SSD performs better as the I/O queue length increase, while others not.
For SSD with scalable queueing performance, it might be better to allow
multiple discrete I/Os.

I'm not sure if "non-rotational" is well suited for tuning the
behavior of elevator merging.

--
Dongjun

Jens Axboe

unread,

Jan 16, 2009, 6:50:06 AM1/16/09

to

On Thu, Jan 15 2009, Grant Grundler wrote:
> On Thu, Jan 15, 2009 at 11:00 AM, James Bottomley
> <James.B...@hansenpartnership.com> wrote:
> ...
> >> Code can test for zero/nonzero or (preferably) more fine grained.
> >> e.g. "avgreadcost > 1ms" or "avgwritecost". I'm hoping this test
> >> can be abstracted into a macro.
> >
> > Um these really have to be things we can get out of the device at boot
> > time without effort (as in part of the data the device can give in a
> > single command). I'll be shot for increasing boot time so we can work
> > out these parameters ...
>
> No. The whole point is we should not care what it is at boot time. It
> should be based on recent history of what is going on. At boot time
> we read the partition table and we superblocks to mount file systems.
> That's fine to start with. So I don't see any need to add some
> synthetic test to establish initial values.
>
> The rest of the code should work regardless of what the values start
> out to be. This is true for the previous proposed patch too when user
> space has to decide what the right policy is.

I absolutely hate the idea of rw cost numbers. Why? Because it's a
property that's impossible to present as a single number. It depends on
so many different things, like cache settings and access pattern. If you
just want to know the avg seek time of your device, look at the reported
RPM value. Userspace can do that, because the kernel doesn't really care
a whole lot about it to be honest.

--
Jens Axboe

Jens Axboe

unread,

Jan 16, 2009, 6:50:07 AM1/16/09

to

On Thu, Jan 15 2009, Hugh Dickins wrote:
> On Thu, 15 Jan 2009, James Bottomley wrote:
> >
> > OK, so they could be calculated on the fly in the elevators, I suppose.
> > But what would the value be? Right now we use the nonrotational flag to
> > basically not bother with plugging (no point if no seek penalty) on
> > certain events where we'd previously have waited for other I/O to join.
> > But that's really a seek penalty parameter rather than the idea of read
> > or write costing (although the elevators usually track these dynamically
> > anyway ... as part of the latency calculations but not explicitly).
>
> ... not bother with plugging (no point if no seek penalty) ...
>
> I thought there was considerable advantage to plugging writes
> (in case they turn out to be adjacent) on current and older
> generations of non-rotational storage?

Don't confuse plugging and merging, although one does help the other.

We can essentially divide the current SSD market into two categories -
queuing and non-queuing. Which also happens to just about the same as
saying Intel and non-Intel, at least that has been the case since Sep
'08 and until present time. On the queuing devices, plugging does more
harm than good. The IO access time is so fast, that delaying for merging
hurts your performance.

For non-queuing devices, I think our current check is a bit too drastic.
We probably want to change that to

int dont_plug(q)
{
return blk_queue_nonrot(q) && blk_queue_tagged(q);
}

Which is identical to what CFQ tests for idling to avoid read/write
overlaps which also completely kills performance on the current SSD
drives (except for Intel, which again shines).

--
Jens Axboe

Jens Axboe

unread,

Jan 16, 2009, 7:00:21 AM1/16/09

to

It's not tuning merging, that's a seperate tuning knob if someone
wishes to turn that off.

--
Jens Axboe

Pierre Ossman

unread,

Feb 3, 2009, 12:40:11 PM2/3/09

to

On Fri, 16 Jan 2009 07:48:18 +0100
Jens Axboe <jens....@oracle.com> wrote:

>
> For non-queuing devices, I think our current check is a bit too drastic.
> We probably want to change that to
>
> int dont_plug(q)
> {
> return blk_queue_nonrot(q) && blk_queue_tagged(q);
> }
>

Please do. The MMC block driver sets nonrot as of late, but it is
extremely dependent on merging. The command overhead is just silly, so
to get anywhere near the speed numbers indicated in marketing you need
large requests.

It seems that too many confuse "non-rotational" with "low latency",
which isn't always the same. :)

Rgds
--
-- Pierre Ossman

Linux kernel, MMC maintainer http://www.kernel.org
rdesktop, core developer http://www.rdesktop.org

WARNING: This correspondence is being monitored by the
Swedish government. Make sure your server uses encryption
for SMTP traffic and consider using PGP for end-to-end
encryption.

signature.asc