[PATCH 1/5] Block: Discard may need to allocate pages

Matthew Wilcox

unread,

Apr 2, 2009, 10:50:13 AM4/2/09

to

From: Matthew Wilcox <mat...@wil.cx>

SCSI and ATA both need to send data to the device. In order to do this,
the BIO must be allocated with room for a page to be added, and the bio
needs to be passed to the discard prep function. We also need to free
the page attached to the BIO before we free it.

init_request_from_bio() is not currently called from a context which
forbids sleeping, and to make sure it stays that way (so we don't have
to use GFP_ATOMIC), add a might_sleep() to it.

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f7dae57..82a3035 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -356,6 +356,8 @@ static void blkdev_discard_end_io(struct bio *bio, int err)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
}

+ if (bio_has_data(bio))
+ __free_page(bio_page(bio));
bio_put(bio);
}

@@ -387,7 +389,7 @@ int blkdev_issue_discard(struct block_device *bdev,
return -EOPNOTSUPP;

while (nr_sects && !ret) {
- bio = bio_alloc(gfp_mask, 0);
+ bio = bio_alloc(gfp_mask, 1);
if (!bio)
return -ENOMEM;

diff --git a/block/blk-core.c b/block/blk-core.c
index 996ed90..7899761 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1095,6 +1095,8 @@ EXPORT_SYMBOL(blk_put_request);

void init_request_from_bio(struct request *req, struct bio *bio)
{
+ might_sleep();
+
req->cpu = bio->bi_comp_cpu;
req->cmd_type = REQ_TYPE_FS;

@@ -1118,7 +1120,7 @@ void init_request_from_bio(struct request *req, struct bio *bio)
req->cmd_flags |= REQ_DISCARD;
if (bio_barrier(bio))
req->cmd_flags |= REQ_SOFTBARRIER;
- req->q->prepare_discard_fn(req->q, req);
+ req->q->prepare_discard_fn(req->q, req, bio);
} else if (unlikely(bio_barrier(bio)))
req->cmd_flags |= (REQ_HARDBARRIER | REQ_NOMERGE);

diff --git a/block/ioctl.c b/block/ioctl.c
index 0f22e62..088a9ba 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -145,7 +145,7 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
DECLARE_COMPLETION_ONSTACK(wait);
struct bio *bio;

- bio = bio_alloc(GFP_KERNEL, 0);
+ bio = bio_alloc(GFP_KERNEL, 1);
if (!bio)
return -ENOMEM;

@@ -170,6 +170,8 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
ret = -EOPNOTSUPP;
else if (!bio_flagged(bio, BIO_UPTODATE))
ret = -EIO;
+ if (bio_has_data(bio))
+ __free_page(bio_page(bio));
bio_put(bio);
}
return ret;
diff --git a/drivers/mtd/mtd_blkdevs.c b/drivers/mtd/mtd_blkdevs.c
index 1409f01..2b6ed4b 100644
--- a/drivers/mtd/mtd_blkdevs.c
+++ b/drivers/mtd/mtd_blkdevs.c
@@ -33,7 +33,7 @@ struct mtd_blkcore_priv {
};

static int blktrans_discard_request(struct request_queue *q,
- struct request *req)
+ struct request *req, struct bio *bio)
{
req->cmd_type = REQ_TYPE_LINUX_BLOCK;
req->cmd[0] = REQ_LB_OP_DISCARD;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 465d6ba..9d9bd7b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -260,7 +260,8 @@ typedef void (request_fn_proc) (struct request_queue *q);
typedef int (make_request_fn) (struct request_queue *q, struct bio *bio);
typedef int (prep_rq_fn) (struct request_queue *, struct request *);
typedef void (unplug_fn) (struct request_queue *);
-typedef int (prepare_discard_fn) (struct request_queue *, struct request *);
+typedef int (prepare_discard_fn) (struct request_queue *, struct request *,
+ struct bio *bio);

struct bio_vec;
struct bvec_merge_data {
--
1.6.2.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Matthew Wilcox

unread,

Apr 2, 2009, 11:10:24 AM4/2/09

to

From: David Woodhouse <dw...@infradead.org>

The commands are conceptually writes, and in the case of IDE and SCSI
commands actually are writes. They were only reads because we thought
that would interact better with the elevators. Now the elevators know
about discard requests, that advantage no longer exists.

Signed-off-by: David Woodhouse <David.W...@intel.com>

Signed-off-by: Matthew Wilcox <wi...@linux.intel.com>
---

include/linux/fs.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 61211ad..e5dc992 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -98,8 +98,8 @@ struct inodes_stat_t {
#define WRITE_SYNC (WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
#define SWRITE_SYNC (SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
#define WRITE_BARRIER (WRITE | (1 << BIO_RW_BARRIER))
-#define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD)
-#define DISCARD_BARRIER ((1 << BIO_RW_DISCARD) | (1 << BIO_RW_BARRIER))
+#define DISCARD_NOBARRIER (WRITE | (1 << BIO_RW_DISCARD))
+#define DISCARD_BARRIER (DISCARD_NOBARRIER | (1 << BIO_RW_BARRIER))

#define SEL_IN 1
#define SEL_OUT 2

Matthew Wilcox

unread,

Apr 2, 2009, 11:10:26 AM4/2/09

to

This is common code shared between the IDE and libata implementations

Signed-off-by: David Woodhouse <David.W...@intel.com>
Signed-off-by: Matthew Wilcox <wi...@linux.intel.com>
---

include/linux/ata.h | 41 +++++++++++++++++++++++++++++++++++++++++
1 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/include/linux/ata.h b/include/linux/ata.h
index 6617c9f..cb79b7a 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -29,6 +29,8 @@
#ifndef __LINUX_ATA_H__
#define __LINUX_ATA_H__

+#include <linux/kernel.h>
+#include <linux/string.h>
#include <linux/types.h>
#include <asm/byteorder.h>

@@ -91,6 +93,7 @@ enum {
ATA_ID_CFA_POWER = 160,
ATA_ID_CFA_KEY_MGMT = 162,
ATA_ID_CFA_MODES = 163,
+ ATA_ID_DATA_SET_MGMT = 169,
ATA_ID_ROT_SPEED = 217,
ATA_ID_PIO4 = (1 << 1),

@@ -248,6 +251,7 @@ enum {
ATA_CMD_SMART = 0xB0,
ATA_CMD_MEDIA_LOCK = 0xDE,
ATA_CMD_MEDIA_UNLOCK = 0xDF,
+ ATA_CMD_DSM = 0x06,
/* marked obsolete in the ATA/ATAPI-7 spec */
ATA_CMD_RESTORE = 0x10,

@@ -321,6 +325,9 @@ enum {
ATA_SMART_READ_VALUES = 0xD0,
ATA_SMART_READ_THRESHOLDS = 0xD1,

+ /* feature values for Data Set Management */
+ ATA_DSM_TRIM = 0x01,
+
/* password used in LBA Mid / LBA High for executing SMART commands */
ATA_SMART_LBAM_PASS = 0x4F,
ATA_SMART_LBAH_PASS = 0xC2,
@@ -723,6 +730,14 @@ static inline int ata_id_has_unload(const u16 *id)
return 0;
}

+static inline int ata_id_has_trim(const u16 *id)
+{
+ if (ata_id_major_version(id) >= 7 &&
+ (id[ATA_ID_DATA_SET_MGMT] & 1))
+ return 1;
+ return 0;
+}
+
static inline int ata_id_current_chs_valid(const u16 *id)
{
/* For ATA-1 devices, if the INITIALIZE DEVICE PARAMETERS command
@@ -863,6 +878,32 @@ static inline void ata_id_to_hd_driveid(u16 *id)
#endif
}

+/*
+ * Write up to 'max' LBA Range Entries to the buffer that will cover the
+ * extent from sector to sector + count. This is used for TRIM and for
+ * ADD LBA(S) TO NV CACHE PINNED SET.
+ */
+static inline unsigned ata_set_lba_range_entries(void *_buffer, unsigned max,
+ u64 sector, unsigned long count)
+{
+ __le64 *buffer = _buffer;
+ unsigned i = 0;
+
+ while (i < max) {
+ u64 entry = sector |
+ ((u64)(count > 0xffff ? 0xffff : count) << 48);
+ buffer[i++] = __cpu_to_le64(entry);
+ if (count <= 0xffff)
+ break;
+ count -= 0xffff;
+ sector += 0xffff;
+ }
+
+ max = ALIGN(i * 8, 512);
+ memset(buffer + i, 0, max - i * 8);
+ return max;
+}
+
static inline int is_multi_taskfile(struct ata_taskfile *tf)
{
return (tf->command == ATA_CMD_READ_MULTI) ||

Sergei Shtylyov

unread,

Apr 2, 2009, 12:00:17 PM4/2/09

to

Matthew Wilcox wrote:

> This is common code shared between the IDE and libata implementations

> Signed-off-by: David Woodhouse <David.W...@intel.com>
> Signed-off-by: Matthew Wilcox <wi...@linux.intel.com>

[...]

I wonder where to read about the word 169 and commnand 0x06... what's it
all about?

WBR, Sergei

Matthew Wilcox

unread,

Apr 2, 2009, 12:20:06 PM4/2/09

to

On Thu, Apr 02, 2009 at 07:55:39PM +0400, Sergei Shtylyov wrote:
> I wonder where to read about the word 169 and commnand 0x06... what's
> it all about?

It's in d2015r1-ATAATAPI_Command_Set_-_2_ACS-2.pdf which can be found on
the t13 website.

Sergei Shtylyov

unread,

Apr 2, 2009, 12:40:12 PM4/2/09

to

Matthew Wilcox wrote:

>> I wonder where to read about the word 169 and commnand 0x06... what's
>>it all about?

> It's in d2015r1-ATAATAPI_Command_Set_-_2_ACS-2.pdf which can be found on
> the t13 website.

But since neither ATA/PI-7 nor ATA/PI-8 mention neither the word nor
command, why your validity check requires only ATA/PI-7+ compliance?

MBR, Sergei

Matthew Wilcox

unread,

Apr 2, 2009, 1:00:21 PM4/2/09

to

On Thu, Apr 02, 2009 at 08:32:56PM +0400, Sergei Shtylyov wrote:
> Matthew Wilcox wrote:
>
>>> I wonder where to read about the word 169 and commnand 0x06...
>>> what's it all about?
>
>> It's in d2015r1-ATAATAPI_Command_Set_-_2_ACS-2.pdf which can be found on
>> the t13 website.
>
> But since neither ATA/PI-7 nor ATA/PI-8 mention neither the word nor
> command, why your validity check requires only ATA/PI-7+ compliance?

Because there are ATA-7 devices which implement this feature.

Boaz Harrosh

unread,

Apr 5, 2009, 8:40:06 AM4/5/09

to

On 04/02/2009 05:37 PM, Matthew Wilcox wrote:
> From: Matthew Wilcox <mat...@wil.cx>
>
> SCSI and ATA both need to send data to the device. In order to do this,
> the BIO must be allocated with room for a page to be added, and the bio
> needs to be passed to the discard prep function. We also need to free
> the page attached to the BIO before we free it.
>
> init_request_from_bio() is not currently called from a context which
> forbids sleeping, and to make sure it stays that way (so we don't have
> to use GFP_ATOMIC), add a might_sleep() to it.
>

I understand you have inherited this code, but I think it is a bit of a mess
and you are only adding to the it.

> Signed-off-by: Matthew Wilcox <wi...@linux.intel.com>
> ---
> block/blk-barrier.c | 4 +++-
> block/blk-core.c | 4 +++-
> block/ioctl.c | 4 +++-
> drivers/mtd/mtd_blkdevs.c | 2 +-
> include/linux/blkdev.h | 3 ++-
> 5 files changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/block/blk-barrier.c b/block/blk-barrier.c
> index f7dae57..82a3035 100644
> --- a/block/blk-barrier.c
> +++ b/block/blk-barrier.c
> @@ -356,6 +356,8 @@ static void blkdev_discard_end_io(struct bio *bio, int err)
> clear_bit(BIO_UPTODATE, &bio->bi_flags);
> }
>
> + if (bio_has_data(bio))
> + __free_page(bio_page(bio));

Page freed which was allocated by the LLD

> bio_put(bio);

OK bio was allocated by user code but shouldn't

> }
>
> @@ -387,7 +389,7 @@ int blkdev_issue_discard(struct block_device *bdev,
> return -EOPNOTSUPP;
>
> while (nr_sects && !ret) {
> - bio = bio_alloc(gfp_mask, 0);
> + bio = bio_alloc(gfp_mask, 1);

blkdev_issue_discard() and blk_ioctl_discard() has half a page
of common (and changing) code, could be done to use a common
helper that sets policy about bio allocation sizes and such.

Just my $0.017

> if (!bio)
> return -ENOMEM;
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 996ed90..7899761 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1095,6 +1095,8 @@ EXPORT_SYMBOL(blk_put_request);
>
> void init_request_from_bio(struct request *req, struct bio *bio)
> {
> + might_sleep();
> +
> req->cpu = bio->bi_comp_cpu;
> req->cmd_type = REQ_TYPE_FS;
>
> @@ -1118,7 +1120,7 @@ void init_request_from_bio(struct request *req, struct bio *bio)
> req->cmd_flags |= REQ_DISCARD;
> if (bio_barrier(bio))
> req->cmd_flags |= REQ_SOFTBARRIER;
> - req->q->prepare_discard_fn(req->q, req);
> + req->q->prepare_discard_fn(req->q, req, bio);

Allocation of bio page could be done commonly here.
The prepare_discard_fn() is made to return the needed size. It is not as if we actually
give the driver a choice about the allocation.

So now we allocate the page and free it at the same level.
And we do it only in one place.

Same common code in [PATCH 4/5] and [PATCH 4/5] is done once, here.

> } else if (unlikely(bio_barrier(bio)))
> req->cmd_flags |= (REQ_HARDBARRIER | REQ_NOMERGE);
>
> diff --git a/block/ioctl.c b/block/ioctl.c
> index 0f22e62..088a9ba 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -145,7 +145,7 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
> DECLARE_COMPLETION_ONSTACK(wait);
> struct bio *bio;
>
> - bio = bio_alloc(GFP_KERNEL, 0);
> + bio = bio_alloc(GFP_KERNEL, 1);

This is deja vu, don't you think ;)

I have one question:

At [PATCH 4/5] and [PATCH 4/5] you do:
+ struct page *page = alloc_page(GFP_KERNEL);

does that zero the alloced page? since if I understand correctly this page
will go on the wire, a SW target on the other size could snoop random Kernel
memory, is that allowed? OK I might be totally clueless here.

Have a good day
Boaz

Matthew Wilcox

unread,

Apr 6, 2009, 4:40:13 PM4/6/09

to

On Sun, Apr 05, 2009 at 03:28:07PM +0300, Boaz Harrosh wrote:
> > + if (bio_has_data(bio))
> > + __free_page(bio_page(bio));
>
> Page freed which was allocated by the LLD

Not by the LLD. By the ULD.

> blkdev_issue_discard() and blk_ioctl_discard() has half a page
> of common (and changing) code, could be done to use a common
> helper that sets policy about bio allocation sizes and such.

Sure, could be done.

> > - req->q->prepare_discard_fn(req->q, req);
> > + req->q->prepare_discard_fn(req->q, req, bio);
>
> Allocation of bio page could be done commonly here.

Not all prepare_discard_fn() implementations need or want a bio page.
The CFA ERASE SECTORS command is a non-data command, for example.

> The prepare_discard_fn() is made to return the needed size. It is not as if we actually
> give the driver a choice about the allocation.

Umm ... prepare_discard_fn() needs to fill in the page. I don't
understand what code you propose here.

> > - bio = bio_alloc(GFP_KERNEL, 0);
> > + bio = bio_alloc(GFP_KERNEL, 1);
>
> This is deja vu, don't you think ;)

Yep.

> I have one question:
>
> At [PATCH 4/5] and [PATCH 4/5] you do:
> + struct page *page = alloc_page(GFP_KERNEL);
>
> does that zero the alloced page? since if I understand correctly this page
> will go on the wire, a SW target on the other size could snoop random Kernel
> memory, is that allowed? OK I might be totally clueless here.

The prepare_discard_fn() is responsible for not leaking data. The ATA
one does it via memset() to a 512 byte boundary. The SCSI one
initialises the 24 bytes that it sends on the wire. I don't think
either implementation leaks uninitialised data.

Jeff Garzik

unread,

Apr 6, 2009, 8:10:12 PM4/6/09

to

Matthew Wilcox wrote:
> This is common code shared between the IDE and libata implementations
>
> Signed-off-by: David Woodhouse <David.W...@intel.com>
> Signed-off-by: Matthew Wilcox <wi...@linux.intel.com>
> ---
> include/linux/ata.h | 41 +++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 41 insertions(+), 0 deletions(-)

applied

Mark Lord

unread,

Apr 17, 2009, 5:30:14 PM4/17/09

to

Matthew Wilcox wrote:
> From: Matthew Wilcox <mat...@wil.cx>
>
> SCSI and ATA both need to send data to the device. In order to do this,
> the BIO must be allocated with room for a page to be added, and the bio
> needs to be passed to the discard prep function. We also need to free
> the page attached to the BIO before we free it.
>
> init_request_from_bio() is not currently called from a context which
> forbids sleeping, and to make sure it stays that way (so we don't have
> to use GFP_ATOMIC), add a might_sleep() to it.
>
> Signed-off-by: Matthew Wilcox <wi...@linux.intel.com>

...

> @@ -1118,7 +1120,7 @@ void init_request_from_bio(struct request *req, struct bio *bio)
> req->cmd_flags |= REQ_DISCARD;
> if (bio_barrier(bio))
> req->cmd_flags |= REQ_SOFTBARRIER;
> - req->q->prepare_discard_fn(req->q, req);
> + req->q->prepare_discard_fn(req->q, req, bio);
> } else if (unlikely(bio_barrier(bio)))
> req->cmd_flags |= (REQ_HARDBARRIER | REQ_NOMERGE);
>

..

Matthew, note that prepare_discardfn() may fail,
and return an error result. The above code does not
check for this or handle it very well.

Cheers

Matthew Wilcox

unread,

May 3, 2009, 2:20:07 AM5/3/09

to

[I thought I replied to this, but I don't see an indication that I did]

On Sun, Apr 05, 2009 at 03:28:07PM +0300, Boaz Harrosh wrote:

> > +++ b/block/blk-barrier.c
> > @@ -356,6 +356,8 @@ static void blkdev_discard_end_io(struct bio *bio, int err)
> > clear_bit(BIO_UPTODATE, &bio->bi_flags);
> > }
> >
> > + if (bio_has_data(bio))
> > + __free_page(bio_page(bio));
>
> Page freed which was allocated by the LLD

It wasn't allocated by the LLD. It was allocated by the ULD.

> > bio_put(bio);
>
> OK bio was allocated by user code but shouldn't

? Are you saying the bio should be allocated by each driver
implementing a discard operation?

> > while (nr_sects && !ret) {
> > - bio = bio_alloc(gfp_mask, 0);
> > + bio = bio_alloc(gfp_mask, 1);
>
> blkdev_issue_discard() and blk_ioctl_discard() has half a page
> of common (and changing) code, could be done to use a common
> helper that sets policy about bio allocation sizes and such.
>
> Just my $0.017

Yes, that works nicely. Thanks for the suggestion.

> > @@ -1118,7 +1120,7 @@ void init_request_from_bio(struct request *req, struct bio *bio)
> > req->cmd_flags |= REQ_DISCARD;
> > if (bio_barrier(bio))
> > req->cmd_flags |= REQ_SOFTBARRIER;
> > - req->q->prepare_discard_fn(req->q, req);
> > + req->q->prepare_discard_fn(req->q, req, bio);
>
> Allocation of bio page could be done commonly here.
> The prepare_discard_fn() is made to return the needed size. It is not as if we actually
> give the driver a choice about the allocation.

Not all drivers need to allocate a page. Some drivers may need
to allocate more than one page, depending on how large the range is.
And the driver can't just return the page size it needs here -- it needs
to fill in the contents of the page too.

I suppose we could do something fairly disgusting like:

for (;;) {
struct page *page;
needed = req->q->prepare_discard_fn(req->q, req, bio);
if (!needed)
break;
page = alloc_page(GFP_KERNEL);
if (bio_add_pc_page(q, bio, page, PAGE_SIZE, 0) < PAGE_SIZE)
goto fail;
}

Then the driver can return 0 => success, anything else => allocate more
ram, try again.

> I have one question:
>
> At [PATCH 4/5] and [PATCH 4/5] you do:
> + struct page *page = alloc_page(GFP_KERNEL);
>
> does that zero the alloced page? since if I understand correctly this page
> will go on the wire, a SW target on the other size could snoop random Kernel
> memory, is that allowed? OK I might be totally clueless here.

alloc_page doesn't zero the page.

scsi only sends out 24 bytes of that page on the wire, and it initialises
all 24 bytes. ide/ata send multiples of 512 bytes on the wire, and
they're careful to zero any of the space they're not using.

Matthew Wilcox

unread,

May 3, 2009, 3:20:06 AM5/3/09

to

On Sat, May 02, 2009 at 11:11:50PM -0700, Matthew Wilcox wrote:
> > blkdev_issue_discard() and blk_ioctl_discard() has half a page
> > of common (and changing) code, could be done to use a common
> > helper that sets policy about bio allocation sizes and such.
> >
> > Just my $0.017
>
> Yes, that works nicely. Thanks for the suggestion.

I've pushed out a new git tree:

http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=shortlog;h=trim-20090502

Changes since the last version:

- Based on 2.6.30-rc4.
- Dropped the three patches which were already merged.
- Shuffled 'Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of
reads' to the front of the queue since it's not controversial and could
be merged earlier.
- Inserted 'Unify blk_ioctl_discard and blkdev_issue_discard' as patch two.
- Updated 'ide: Add support for TRIM' to compile with new IDE code.

Still to do:
- Understand all the changes Bart made to the IDE code; I didn't make all
the changes that he did. I just made it compile for now.
- Dave had some objections to the description of 'Make DISCARD_BARRIER and
DISCARD_NOBARRIER writes instead of reads', but didn't propose replacement
text.
- Figure out what to do about memory allocation.
- Handle error returns from discard_prep.
- Test the new code still works.
- Delve into the latest SCSI spec and see if anything changed in UNMAP

Given the extensive length of the todo list, I shan't send out the mailbomb.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Hugh Dickins

unread,

May 3, 2009, 9:10:11 AM5/3/09

to

On Sun, 3 May 2009, Matthew Wilcox wrote:
> On Sat, May 02, 2009 at 11:11:50PM -0700, Matthew Wilcox wrote:
> > > blkdev_issue_discard() and blk_ioctl_discard() has half a page
> > > of common (and changing) code, could be done to use a common
> > > helper that sets policy about bio allocation sizes and such.
> > >
> > > Just my $0.017
> >
> > Yes, that works nicely. Thanks for the suggestion.
>
> I've pushed out a new git tree:
>
> http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=shortlog;h=trim-20090502

I'm glad to see this coming alive again, thank you.

I notice OCZ hope to have updated Vertex firmware mid-May,
such that their TRIM support matches up with (y)ours.

One comment below...

>
> Changes since the last version:
>
> - Based on 2.6.30-rc4.
> - Dropped the three patches which were already merged.
> - Shuffled 'Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of
> reads' to the front of the queue since it's not controversial and could
> be merged earlier.
> - Inserted 'Unify blk_ioctl_discard and blkdev_issue_discard' as patch two.
> - Updated 'ide: Add support for TRIM' to compile with new IDE code.
>
> Still to do:
> - Understand all the changes Bart made to the IDE code; I didn't make all
> the changes that he did. I just made it compile for now.
> - Dave had some objections to the description of 'Make DISCARD_BARRIER and
> DISCARD_NOBARRIER writes instead of reads', but didn't propose replacement
> text.
> - Figure out what to do about memory allocation.

You have GFP_KERNEL allocations in your discard_fn()s. I believe that
allocations down at that level will need to be GFP_NOIO - the bio mempool
might be empty by this point, you don't want memory allocation to get into
submitting even more I/O to satisfy your discard_fn(). Does it need its
own mempool of reserve pages? My guess is that you can get away without
that, since a failed discard, though regrettable, loses no data.

Hugh

> - Handle error returns from discard_prep.
> - Test the new code still works.
> - Delve into the latest SCSI spec and see if anything changed in UNMAP
>
> Given the extensive length of the todo list, I shan't send out the mailbomb.
--

Matthew Wilcox

unread,

May 3, 2009, 10:50:06 AM5/3/09

to

On Sun, May 03, 2009 at 02:07:32PM +0100, Hugh Dickins wrote:
> I notice OCZ hope to have updated Vertex firmware mid-May,
> such that their TRIM support matches up with (y)ours.

Excellent.

> You have GFP_KERNEL allocations in your discard_fn()s. I believe that
> allocations down at that level will need to be GFP_NOIO - the bio mempool
> might be empty by this point, you don't want memory allocation to get into
> submitting even more I/O to satisfy your discard_fn(). Does it need its
> own mempool of reserve pages? My guess is that you can get away without
> that, since a failed discard, though regrettable, loses no data.

You make a good point. There's definitely still work to be done around
error handling. OTOH, we could always do what IDE does ...

static void idedisk_prepare_flush(struct request_queue *q, struct request *rq)
{
ide_drive_t *drive = q->queuedata;
struct ide_cmd *cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);

/* FIXME: map struct ide_taskfile on rq->cmd[] */
BUG_ON(cmd == NULL);

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Boaz Harrosh

unread,

May 3, 2009, 11:10:05 AM5/3/09

to

On 05/03/2009 05:48 PM, Matthew Wilcox wrote:
> On Sun, May 03, 2009 at 02:07:32PM +0100, Hugh Dickins wrote:
>> I notice OCZ hope to have updated Vertex firmware mid-May,
>> such that their TRIM support matches up with (y)ours.
>
> Excellent.
>
>> You have GFP_KERNEL allocations in your discard_fn()s. I believe that
>> allocations down at that level will need to be GFP_NOIO - the bio mempool
>> might be empty by this point, you don't want memory allocation to get into
>> submitting even more I/O to satisfy your discard_fn(). Does it need its
>> own mempool of reserve pages? My guess is that you can get away without
>> that, since a failed discard, though regrettable, loses no data.
>
> You make a good point. There's definitely still work to be done around
> error handling. OTOH, we could always do what IDE does ...
>
> static void idedisk_prepare_flush(struct request_queue *q, struct request *rq)
> {
> ide_drive_t *drive = q->queuedata;
> struct ide_cmd *cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
>
> /* FIXME: map struct ide_taskfile on rq->cmd[] */
> BUG_ON(cmd == NULL);
>
>
>

Out from the frying pan and into the fire. ;-)

I agree with Hugh. The allocation is done at, too-low in the food chain.
(And that free of buffer at upper layer allocated by lower layer).

I think you need to separate the: "does lld need buffer, what size"
from the "here is buffer prepare", so upper layer that can sleep does
sleep.

In all other buffer needing operations the allocation is done before
submission of request, No?

Just my $0.017
Thanks
Boaz

BTW:
I've not have time to review second round will "TODO"

Hugh Dickins

unread,

May 3, 2009, 11:10:06 AM5/3/09

to

On Sun, 3 May 2009, Matthew Wilcox wrote:
>

> You make a good point. There's definitely still work to be done around
> error handling. OTOH, we could always do what IDE does ...
>
> static void idedisk_prepare_flush(struct request_queue *q, struct request *rq)
> {
> ide_drive_t *drive = q->queuedata;
> struct ide_cmd *cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
>
> /* FIXME: map struct ide_taskfile on rq->cmd[] */
> BUG_ON(cmd == NULL);

Oooh, you're winding me up. Certainly not. Off with the Top Of your Head!

Hugh

Matthew Wilcox

unread,

May 3, 2009, 11:50:09 AM5/3/09

to

On Sun, May 03, 2009 at 06:02:51PM +0300, Boaz Harrosh wrote:
> I agree with Hugh. The allocation is done at, too-low in the food chain.
> (And that free of buffer at upper layer allocated by lower layer).
>
> I think you need to separate the: "does lld need buffer, what size"
> from the "here is buffer prepare", so upper layer that can sleep does
> sleep.

So you want two function pointers in the request queue relating to discard?

> In all other buffer needing operations the allocation is done before
> submission of request, No?

It's not true for the flush request (the example I quoted). Obviously,
the solution adopted here by IDE is Bad and Wrong ...

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Boaz Harrosh

unread,

May 3, 2009, 12:40:07 PM5/3/09

to

On 05/03/2009 06:42 PM, Matthew Wilcox wrote:
> On Sun, May 03, 2009 at 06:02:51PM +0300, Boaz Harrosh wrote:
>> I agree with Hugh. The allocation is done at, too-low in the food chain.
>> (And that free of buffer at upper layer allocated by lower layer).
>>
>> I think you need to separate the: "does lld need buffer, what size"
>> from the "here is buffer prepare", so upper layer that can sleep does
>> sleep.
>
> So you want two function pointers in the request queue relating to discard?
>

OK I don't know what I want, I guess. ;-)

I'm not a block-device export but from the small osdblk device I maintain
it looks like osdblk_prepare_flush which is set into:
blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, osdblk_prepare_flush);

does some internal structure setup, but the actual flush command is only executed
later in the global osdblk_rq_fn which is set into:
blk_init_queue(osdblk_rq_fn, &osdev->lock);

But I'm not even sure that prepare_flush is called in a better context then
queue_fn, and what does it means to let block devices take care of another
new command type at queue_fn.

I guess it comes back to Jeff Garzik's comment about not having a central
place to ask the request what we need to do.

But I do hate that allocation is done by driver and free by mid-layer,
so yes two vectors, request_queue is allocated once per device it's not
that bad. And later when Jeff's comment is addressed it can be removed.

>> In all other buffer needing operations the allocation is done before
>> submission of request, No?
>
> It's not true for the flush request (the example I quoted). Obviously,
> the solution adopted here by IDE is Bad and Wrong ...
>

So now I don't understand, is flush executed by the queue_fn or by the
prepare_flush, or am I seeing two type of flushes?

Thanks
Boaz

Jeff Garzik

unread,

May 3, 2009, 2:40:07 PM5/3/09

to

Boaz Harrosh wrote:
> On 05/03/2009 06:42 PM, Matthew Wilcox wrote:
>> On Sun, May 03, 2009 at 06:02:51PM +0300, Boaz Harrosh wrote:
>>> I agree with Hugh. The allocation is done at, too-low in the food chain.
>>> (And that free of buffer at upper layer allocated by lower layer).
>>>
>>> I think you need to separate the: "does lld need buffer, what size"
>>> from the "here is buffer prepare", so upper layer that can sleep does
>>> sleep.
>> So you want two function pointers in the request queue relating to discard?
>>
>
> OK I don't know what I want, I guess. ;-)
>
> I'm not a block-device export but from the small osdblk device I maintain
> it looks like osdblk_prepare_flush which is set into:
> blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, osdblk_prepare_flush);
>
> does some internal structure setup, but the actual flush command is only executed
> later in the global osdblk_rq_fn which is set into:
> blk_init_queue(osdblk_rq_fn, &osdev->lock);
>
> But I'm not even sure that prepare_flush is called in a better context then
> queue_fn, and what does it means to let block devices take care of another
> new command type at queue_fn.
>
> I guess it comes back to Jeff Garzik's comment about not having a central
> place to ask the request what we need to do.
>
> But I do hate that allocation is done by driver and free by mid-layer,
> so yes two vectors, request_queue is allocated once per device it's not
> that bad. And later when Jeff's comment is addressed it can be removed.

May I presume you are referring to the following osdblk.c comment?

/* deduce our operation (read, write, flush) */
/* I wish the block layer simplified
* cmd_type/cmd_flags/cmd[]
* into a clearly defined set of RPC commands:
* read, write, flush, scsi command, power mgmt req,
* driver-specific, etc.
*/

Yes, the task of figuring out -what to do- in the queue's request
function is quite complex, and discard makes it even more so.

The API makes life difficult -- you have to pass temporary info to
yourself in ->prepare_flush_fn() and ->prepare_discard_fn(), and the
overall sum is a bewildering collection of opcodes, flags, and internal
driver notes to itself.

Add to this yet another prep function, ->prep_rq_fn()

It definitely sucks, especially with regards to failed atomic
allocations... but I think fixing this quite a big more than Matthew
probably willing to tackle ;-)

My ideal block layer interface would be a lot more opcode-based, e.g.

(1) create REQ_TYPE_DISCARD

(2) determine at init if queue (a) supports explicit DISCARD and/or (b)
supports DISCARD flag passed with READ or WRITE

(3) when creating a discard request, use block helpers w/ queue-specific
knowledge to create either
(a) one request, REQ_TYPE_FS, with discard flag or
(b) two requests, REQ_TYPE_FS followed by REQ_TYPE_DISCARD

(4) blkdev_issue_discard() would function like an empty barrier, and
unconditionally create REQ_TYPE_DISCARD.

This type of setup would require NO prepare_discard command, as all
knowledge would be passed directly to ->prep_rq_fn() and ->request_fn()

And to tangent a bit... I feel barriers should be handled in exactly
the same way. Create REQ_TYPE_FLUSH, which would be issued for above
examples #2a and #4, if the queue is setup that way.

All this MINIMIZES the amount of information a driver must "pass to
itself", by utilizing existing ->prep_fn_rq() and ->request_fn() pathways.

Jeff

Bartlomiej Zolnierkiewicz

unread,

May 3, 2009, 2:50:10 PM5/3/09

to

On Sunday 03 May 2009 17:42:16 Matthew Wilcox wrote:
> On Sun, May 03, 2009 at 06:02:51PM +0300, Boaz Harrosh wrote:
> > I agree with Hugh. The allocation is done at, too-low in the food chain.
> > (And that free of buffer at upper layer allocated by lower layer).
> >
> > I think you need to separate the: "does lld need buffer, what size"
> > from the "here is buffer prepare", so upper layer that can sleep does
> > sleep.
>
> So you want two function pointers in the request queue relating to discard?
>
> > In all other buffer needing operations the allocation is done before
> > submission of request, No?
>
> It's not true for the flush request (the example I quoted). Obviously,
> the solution adopted here by IDE is Bad and Wrong ...

When speaking about "the solution" do you mean that quick & ugly hack
that I needed to fix the unexpected regression during late -rc cycle? ;)

The proper solution would involve adding 'struct ide_cmd flush_cmd' to
ide_drive_t but I never got to verify/inquiry that is OK with the block
layer (if there can be only one flush rq outstanding or if we can have
both q->pre_flush_rq and q->post_flush_rq queued at the same time)...

Thanks,
Bart

Jeff Garzik

unread,

May 3, 2009, 2:50:09 PM5/3/09

to

Jeff Garzik wrote:
> (2) determine at init if queue (a) supports explicit DISCARD and/or (b)
> supports DISCARD flag passed with READ or WRITE

As an aside -- does any existing command set support case #b, above?

AFAICT, ATA, SCSI and NVMHCI all have a single, explicit hardware
command to discard/deallocate unused sectors.

Therefore, creating REQ_TYPE_DISCARD seems to eliminate any need for new
hook ->prepare_discard().

This provides a 1:1 correspondence between hardware and struct request,
most closely matching the setup of known hardware.

James Bottomley

unread,

May 3, 2009, 3:10:15 PM5/3/09

to

On Sun, 2009-05-03 at 14:40 -0400, Jeff Garzik wrote:
> Jeff Garzik wrote:
> > (2) determine at init if queue (a) supports explicit DISCARD and/or (b)
> > supports DISCARD flag passed with READ or WRITE
>
>
> As an aside -- does any existing command set support case #b, above?

Not to my knowledge.

I think discard was modelled on barrier (which can be associated with
data or stand on its own).

> AFAICT, ATA, SCSI and NVMHCI all have a single, explicit hardware
> command to discard/deallocate unused sectors.
>
> Therefore, creating REQ_TYPE_DISCARD seems to eliminate any need for new
> hook ->prepare_discard().

Well, yes and no ... in the SCSI implementation of either UNMAP or WRITE
SAME, we still need a data buffer to store either the extents or the
actual same data.

> This provides a 1:1 correspondence between hardware and struct request,
> most closely matching the setup of known hardware.

Agreed ... but we still have to allocate the adjunct buffer
somewhere ....

James

Jeff Garzik

unread,

May 3, 2009, 3:30:16 PM5/3/09

to

James Bottomley wrote:
> On Sun, 2009-05-03 at 14:40 -0400, Jeff Garzik wrote:
>> Jeff Garzik wrote:
>>> (2) determine at init if queue (a) supports explicit DISCARD and/or (b)
>>> supports DISCARD flag passed with READ or WRITE
>>
>> As an aside -- does any existing command set support case #b, above?
>
> Not to my knowledge.
>
> I think discard was modelled on barrier (which can be associated with
> data or stand on its own).

Yeah -- I am thinking we can reduce complexity if no real hardware
supports "associated with data."

>> AFAICT, ATA, SCSI and NVMHCI all have a single, explicit hardware
>> command to discard/deallocate unused sectors.
>>
>> Therefore, creating REQ_TYPE_DISCARD seems to eliminate any need for new
>> hook ->prepare_discard().
>
> Well, yes and no ... in the SCSI implementation of either UNMAP or WRITE
> SAME, we still need a data buffer to store either the extents or the
> actual same data.

Sure, UNMAP's extents would quite naturally travel with the
REQ_TYPE_DISCARD struct request, and can be set up at struct request
allocation time. In all of ATA, SCSI and NVMHCI, they appear to want
the extents.

Is WRITE SAME associated with this current DISCARD work, or is that just
a similar example? I'm unfamiliar with its issues...

>> This provides a 1:1 correspondence between hardware and struct request,
>> most closely matching the setup of known hardware.
>
> Agreed ... but we still have to allocate the adjunct buffer
> somewhere ....

Agreed. I merely argue ->prep_rq_fn() would be a better place -- it can
at least return a useful return value -- than ->prepare_discard(), which
would be the natural disposition for a REQ_TYPE_DISCARD -- or
->request_fn() itself, if a block driver so chooses.

The extents would be an argument to REQ_TYPE_DISCARD, and should be
associated with struct request somehow, by struct request's creator.
The extent info needs to be in some generic form, because all of
ATA|SCSI|NVMHCI seem to want it.

[tangent...]

Does make you wonder if a ->init_rq_fn() would be helpful, one that
could perform gfp_t allocations rather than GFP_ATOMIC? The idea being
to call ->init_rq_fn() almost immediately after creation of struct
request, by the struct request creator.

I obviously have not thought in depth about this, but it does seem that
init_rq_fn(), called earlier in struct request lifetime, could eliminate
the need for ->prepare_flush, ->prepare_discard, and perhaps could be a
better place for some of the ->prep_rq_fn logic.

The creator of struct request generally has more freedom to sleep, and
it seems logical to give low-level drivers a "fill in LLD-specific info"
hook BEFORE the request is ever added to a request_queue.

Who knows...

Jeff

James Bottomley

unread,

May 3, 2009, 3:40:06 PM5/3/09

to

On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote:
> Is WRITE SAME associated with this current DISCARD work, or is that just
> a similar example? I'm unfamiliar with its issues...

It's an adjunct body of work. T10 apparently ratified both UNMAP and
the WRITE SAME extensions. What WRITE SAME does is write the same data
block to multiple contiguous locations as specified in the CDB. What
the thin provisioning update did for it is allow you to specify a flag
saying I want these sectors unmapped. The perceived benefit of WRITE
SAME is that you specify (with the write same data ... presumably all
zeros) what an unmapped sector will return if it's ever read from again,
which was a big argument in the UNMAP case.

James

James Bottomley

unread,

May 3, 2009, 3:50:11 PM5/3/09

to

On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote:

> [tangent...]
>
> Does make you wonder if a ->init_rq_fn() would be helpful, one that
> could perform gfp_t allocations rather than GFP_ATOMIC? The idea being
> to call ->init_rq_fn() almost immediately after creation of struct
> request, by the struct request creator.

Isn't that what the current prep_fn actually is?

> I obviously have not thought in depth about this, but it does seem that
> init_rq_fn(), called earlier in struct request lifetime, could eliminate
> the need for ->prepare_flush, ->prepare_discard, and perhaps could be a
> better place for some of the ->prep_rq_fn logic.

It's hard to see how ... prep_rq_fn is already called pretty early ...
almost as soon as the elevator has decided to spit out the request

> The creator of struct request generally has more freedom to sleep, and
> it seems logical to give low-level drivers a "fill in LLD-specific info"
> hook BEFORE the request is ever added to a request_queue.

Unfortunately it's not really possible to find a sleeping context in
there: The elevators have to operate from the current
elv_next_request() context, which, in most drivers can either be user or
interrupt.

The way the block layer is designed is to pull allocations up the stack
much closer to the process (usually at the bio creation point) because
that allows the elevators to operate even in memory starved conditions.
If we pushed the allocation down into the request level, we'd need some
type of threading (bad for performance) and the request processing would
stall when some GFP_KERNEL allocation went out to lunch finding memory.

The ideal for REQ_TYPE_DISCARD seems to be to force a page allocation
tied to a bio when it's issued at the top. That way everyone has enough
memory when it comes down the stack (both extents and WRITE SAME sector
will fit into a page ... although only just for WRITE SAME on 4k
sectors).

James

Matthew Wilcox

unread,

May 3, 2009, 5:50:06 PM5/3/09

to

On Sun, May 03, 2009 at 02:34:35PM -0400, Jeff Garzik wrote:
> Yes, the task of figuring out -what to do- in the queue's request
> function is quite complex, and discard makes it even more so.
>
> The API makes life difficult -- you have to pass temporary info to
> yourself in ->prepare_flush_fn() and ->prepare_discard_fn(), and the
> overall sum is a bewildering collection of opcodes, flags, and internal
> driver notes to itself.
>
> Add to this yet another prep function, ->prep_rq_fn()
>
> It definitely sucks, especially with regards to failed atomic
> allocations... but I think fixing this quite a big more than Matthew
> probably willing to tackle ;-)

I'm completely confused by the block layer API, to be honest. Trying to
deduce how to add a new feature at this stage is hard (compare it to
adding the reflink operation to the VFS ...). I'm definitely willing to
tackle changing the block device API, but it may take a while.

> My ideal block layer interface would be a lot more opcode-based, e.g.
>
> (1) create REQ_TYPE_DISCARD
>
> (2) determine at init if queue (a) supports explicit DISCARD and/or (b)
> supports DISCARD flag passed with READ or WRITE
>
> (3) when creating a discard request, use block helpers w/ queue-specific
> knowledge to create either
> (a) one request, REQ_TYPE_FS, with discard flag or
> (b) two requests, REQ_TYPE_FS followed by REQ_TYPE_DISCARD

I'm not sure we need option 3b.

> (4) blkdev_issue_discard() would function like an empty barrier, and
> unconditionally create REQ_TYPE_DISCARD.

I can certainly prototype a replacement for discard_prep_fn along these lines.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Jeff Garzik

unread,

May 3, 2009, 6:50:10 PM5/3/09

to

James Bottomley wrote:
> On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote:
>> [tangent...]
>>
>> Does make you wonder if a ->init_rq_fn() would be helpful, one that
>> could perform gfp_t allocations rather than GFP_ATOMIC? The idea being
>> to call ->init_rq_fn() almost immediately after creation of struct
>> request, by the struct request creator.
>
> Isn't that what the current prep_fn actually is?

> It's hard to see how ... prep_rq_fn is already called pretty early ...

> almost as soon as the elevator has decided to spit out the request

prep_rq_fn is called very, very late -- from elv_next_request(), which
is called from ->request_fn.

As quoted above, I'm talking about something closer to -creation- time,
not -execution- time.

>> The creator of struct request generally has more freedom to sleep, and
>> it seems logical to give low-level drivers a "fill in LLD-specific info"
>> hook BEFORE the request is ever added to a request_queue.
>
> Unfortunately it's not really possible to find a sleeping context in
> there: The elevators have to operate from the current
> elv_next_request() context, which, in most drivers can either be user or
> interrupt.

Sure, because that's further down the pipeline at the request execution
stage. Think more like make_request time...

> The ideal for REQ_TYPE_DISCARD seems to be to force a page allocation
> tied to a bio when it's issued at the top. That way everyone has enough
> memory when it comes down the stack (both extents and WRITE SAME sector
> will fit into a page ... although only just for WRITE SAME on 4k
> sectors).

Makes sense...

Jeff

Jeff Garzik

unread,

May 3, 2009, 7:00:13 PM5/3/09

to

Matthew Wilcox wrote:
>> (3) when creating a discard request, use block helpers w/ queue-specific
>> knowledge to create either
>> (a) one request, REQ_TYPE_FS, with discard flag or
>> (b) two requests, REQ_TYPE_FS followed by REQ_TYPE_DISCARD
>
> I'm not sure we need option 3b.

Well -- it is a hard requirement to map 1:1 struct request with the
underlying hardware device's command set.

If a device command set lacks a READ-and-DISCARD operation, then you
_must_ create two struct request. Otherwise you break block layer
tagging and other 1:1-based assumptions that exist today.

Thus, given that all of ATA|SCSI|NVMHCI have a DISCARD operation
distinct from other commands...

option 3b is the overwhelming common case.

Jeff

Douglas Gilbert

unread,

May 4, 2009, 10:10:13 AM5/4/09

to

James Bottomley wrote:
> On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote:
>> Is WRITE SAME associated with this current DISCARD work, or is that just
>> a similar example? I'm unfamiliar with its issues...
>
> It's an adjunct body of work. T10 apparently ratified both UNMAP and
> the WRITE SAME extensions. What WRITE SAME does is write the same data
> block to multiple contiguous locations as specified in the CDB. What
> the thin provisioning update did for it is allow you to specify a flag
> saying I want these sectors unmapped. The perceived benefit of WRITE
> SAME is that you specify (with the write same data ... presumably all
> zeros) what an unmapped sector will return if it's ever read from again,
> which was a big argument in the UNMAP case.

James,
Your presumption is correct. For the UNMAP bit to be honoured
in the SCSI WRITE SAME command, the user data part of the
data-out buffer needs to be all zeros, and, if present,
the protection data part of the data-out buffer needs
to be all 0xff_s (i.e. 8 bytes of 0xff). Otherwise the
UNMAP bit in WRITE SAME command is ignored and it does a
"normal" WRITE SAME.

My $0.02's worth was a suggestion to report an error if the
UNMAP bit was given to WRITE SAME and the data-out
buffer did not comply with the above pattern. Alternatively
the data-out buffer could just be ignored. The author
of the WRITE SAME "unmap" facility duly noted my observations
and rejected them :-) The wording in sbc3r18.pdf for WRITE SAME
is contorted so there will be changes. And t10 is still
having teleconferences about thin provisioning so there may be
non-trivial changes in the near future.

Doug Gilbert

James Bottomley

unread,

May 4, 2009, 10:50:17 AM5/4/09

to

Actually, I'd just like something far more basic: forcing a thin
provisioned array to support all of the three possible mechanisms. It's
going to be a real mess trying to work out for any given array do you
support UNMAP or WRITE SAME(16) or WRITE SAME(32)? We can only do this
currently by trying the commands ... then we have to have support for
all three built into sd just in case ... and we get precisely the same
functionality in each case ...

James

Douglas Gilbert

unread,

May 4, 2009, 11:20:16 AM5/4/09

to

James,
Another aspect, especially if a large amount of storage
is to be trimmed, is how long will it take? This relates
to the timeout value we should associate with such an
invocation. The FORMAT UNIT and START STOP UNIT commands
have an IMMED bit, but not WRITE SAME.

Speaking of FORMAT UNIT, some words were added into sbc3r18
that suggest a FORMAT UNIT command could be interpreted as
unmap/trim the whole disk.

Doug Gilbert

James Bottomley

unread,

May 4, 2009, 11:30:15 AM5/4/09

to

On Mon, 2009-05-04 at 17:11 +0200, Douglas Gilbert wrote:
> Another aspect, especially if a large amount of storage
> is to be trimmed, is how long will it take? This relates
> to the timeout value we should associate with such an
> invocation. The FORMAT UNIT and START STOP UNIT commands
> have an IMMED bit, but not WRITE SAME.

I was assuming it would be more or less instantaneous ... if it's not
then the plan of trying to keep thin provisioning happy just by sending
down a maximal trim from the filesystem is going to run into trouble.

> Speaking of FORMAT UNIT, some words were added into sbc3r18
> that suggest a FORMAT UNIT command could be interpreted as
> unmap/trim the whole disk.

Seems reasonable ... if you format the device it could be argued you're
not relying on the data contents any more ...

James

Boaz Harrosh

unread,

May 4, 2009, 11:30:21 AM5/4/09

to

On 05/04/2009 01:47 AM, Jeff Garzik wrote:

> James Bottomley wrote:
>> The ideal for REQ_TYPE_DISCARD seems to be to force a page allocation
>> tied to a bio when it's issued at the top. That way everyone has enough
>> memory when it comes down the stack (both extents and WRITE SAME sector
>> will fit into a page ... although only just for WRITE SAME on 4k
>> sectors).
>
> Makes sense...
>
> Jeff
>

I second that, let the allocator of the request supply a buffer and free it
on return/done. Perhaps with a general helper that does all that.

Boaz