Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

[PATCH]: New implementation of scsi_execute_async()

14 views

Skip to first unread message

Vladislav Bolkhovitin

unread,

Jul 9, 2009, 2:20:14 PM7/9/09

This patch reimplements scsi_execute_async(). In the new version it's a lot less
hackish and also has additional features. Namely:

1. Possibility to insert commands both in tail and in head of the queue.

2. Possibility to explicitly specify if the last SG element has space for padding.

This patch based on the previous patches posted by Tejun Heo. Comparing to them
it has the following improvements:

1. It uses BIOs chaining instead of kmalloc()ing the whole bio.

2. It uses SGs chaining instead of kmalloc()ing one big SG in case if direct
mapping failed (e.g. because of DMA alignment or padding).

3. If direct mapping failed, if possible, it copies only the last SG element,
not the whole SG.

Also this patch adds and exports function blk_copy_sg(), which copies one SG to
another.

At the moment SCST is the only user of this functionality. It needs it, because
its target drivers, which are, basically, SCSI drivers, can deal only with SGs,
not with BIOs. But, according the latest discussions, there are other potential
users for of this functionality, so I'm sending this patch in a hope that it will be
also useful for them and eventually will be merged in the mainline kernel.

This patch requires previously sent patch with subject "[PATCH]: Rename REQ_COPY_USER
to more descriptive REQ_HAS_TAIL_SPACE_FOR_PADDING".

It's against 2.6.30.1, but if necessary, I can update it to any necessary
kernel version.

Signed-off-by: Vladislav Bolkhovitin <v...@vlnb.net>

block/blk-map.c | 536 +++++++++++++++++++++++++++++++++++++++++++++
drivers/scsi/scsi_lib.c | 108 ++++++++-
include/linux/blkdev.h | 6
include/scsi/scsi_device.h | 11
4 files changed, 660 insertions(+), 1 deletion(-)

diff -upkr linux-2.6.30.1/block/blk-map.c linux-2.6.30.1/block/blk-map.c
--- linux-2.6.30.1/block/blk-map.c 2009-06-10 07:05:27.000000000 +0400
+++ linux-2.6.30.1/block/blk-map.c 2009-07-09 21:33:07.000000000 +0400
@@ -5,6 +5,7 @@
#include <linux/module.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
+#include <linux/scatterlist.h>
#include <scsi/sg.h> /* for struct sg_iovec */

#include "blk.h"
@@ -273,6 +274,541 @@ int blk_rq_unmap_user(struct bio *bio)
EXPORT_SYMBOL(blk_rq_unmap_user);

/**
+ * blk_copy_sg - copy one SG vector to another
+ * @dst_sg: destination SG
+ * @src_sg: source SG
+ * @copy_len: maximum amount of data to copy. If 0, then copy all.
+ * @d_km_type: kmap_atomic type for the destination SG
+ * @s_km_type: kmap_atomic type for the source SG
+ *
+ * Description:
+ * Data from the destination SG vector will be copied to the source SG
+ * vector. End of the vectors will be determined by sg_next() returning
+ * NULL. Returns number of bytes copied.
+ */
+int blk_copy_sg(struct scatterlist *dst_sg,
+ struct scatterlist *src_sg, size_t copy_len,
+ enum km_type d_km_type, enum km_type s_km_type)
+{
+ int res = 0;
+ size_t src_len, dst_len, src_offs, dst_offs;
+ struct page *src_page, *dst_page;
+
+ if (copy_len == 0)
+ copy_len = 0x7FFFFFFF; /* copy all */
+
+ dst_page = sg_page(dst_sg);
+ dst_len = dst_sg->length;
+ dst_offs = dst_sg->offset;
+
+ src_offs = 0;
+ do {
+ src_page = sg_page(src_sg);
+ src_len = src_sg->length;
+ src_offs = src_sg->offset;
+
+ do {
+ void *saddr, *daddr;
+ size_t n;
+
+ saddr = kmap_atomic(src_page, s_km_type) + src_offs;
+ daddr = kmap_atomic(dst_page, d_km_type) + dst_offs;
+
+ if ((src_offs == 0) && (dst_offs == 0) &&
+ (src_len >= PAGE_SIZE) && (dst_len >= PAGE_SIZE) &&
+ (copy_len >= PAGE_SIZE)) {
+ copy_page(daddr, saddr);
+ n = PAGE_SIZE;
+ } else {
+ n = min_t(size_t, PAGE_SIZE - dst_offs,
+ PAGE_SIZE - src_offs);
+ n = min(n, src_len);
+ n = min(n, dst_len);
+ n = min_t(size_t, n, copy_len);
+ memcpy(daddr, saddr, n);
+ dst_offs += n;
+ src_offs += n;
+ }
+
+ kunmap_atomic(saddr, s_km_type);
+ kunmap_atomic(daddr, d_km_type);
+
+ res += n;
+ copy_len -= n;
+ if (copy_len == 0)
+ goto out;
+
+ if ((src_offs & ~PAGE_MASK) == 0) {
+ src_page = nth_page(src_page, 1);
+ src_offs = 0;
+ }
+ if ((dst_offs & ~PAGE_MASK) == 0) {
+ dst_page = nth_page(dst_page, 1);
+ dst_offs = 0;
+ }
+
+ src_len -= n;
+ dst_len -= n;
+ if (dst_len == 0) {
+ dst_sg = sg_next(dst_sg);
+ if (dst_sg == NULL)
+ goto out;
+ dst_page = sg_page(dst_sg);
+ dst_len = dst_sg->length;
+ dst_offs = dst_sg->offset;
+ }
+ } while (src_len > 0);
+
+ src_sg = sg_next(src_sg);
+ } while (src_sg != NULL);
+
+out:
+ return res;
+}
+EXPORT_SYMBOL(blk_copy_sg);
+
+/**
+ * blk_rq_unmap_kern_sg - "unmaps" data buffers in the request
+ * @req: request to unmap
+ * @do_copy: sets copy data between buffers, if needed, or not
+ *
+ * Description:
+ * It frees all additional buffers allocated for SG->BIO mapping.
+ */
+void blk_rq_unmap_kern_sg(struct request *req, int do_copy)
+{
+ struct scatterlist *hdr = (struct scatterlist *)req->end_io_data;
+
+ if (hdr == NULL)
+ goto out;
+
+ if (hdr->length == 0) {
+ /* Tail element only was copied */
+ struct scatterlist *new_sg = &hdr[1];
+ struct scatterlist *orig_sg = (struct scatterlist *)hdr->page_link;
+
+ if ((rq_data_dir(req) == READ) && do_copy) {
+ void *saddr, *daddr;
+
+ saddr = kmap_atomic(sg_page(orig_sg), KM_BIO_SRC_IRQ);
+ daddr = kmap_atomic(sg_page(new_sg), KM_BIO_DST_IRQ) +
+ new_sg->offset;
+ memcpy(daddr, saddr, orig_sg->length);
+ kunmap_atomic(saddr, KM_BIO_SRC_IRQ);
+ kunmap_atomic(daddr, KM_BIO_DST_IRQ);
+ }
+
+ __free_pages(sg_page(orig_sg), get_order(orig_sg->length));
+ *orig_sg = *new_sg;
+ kfree(hdr);
+ } else {
+ /* The whole SG was copied */
+ struct scatterlist *new_sgl = &hdr[1];
+ struct scatterlist *orig_sgl = (struct scatterlist *)hdr->page_link;
+ struct scatterlist *sg, *start_sg;
+ int n;
+
+ if ((rq_data_dir(req) == READ) && do_copy) {
+ blk_copy_sg(orig_sgl, new_sgl, 0, KM_BIO_DST_IRQ,
+ KM_BIO_SRC_IRQ);
+ }
+
+ start_sg = hdr;
+ sg = new_sgl;
+ n = 1;
+ while (sg != NULL) {
+ __free_page(sg_page(sg));
+ sg = sg_next(sg);
+ n++;
+ /* One entry for chaining */
+ if ((sg == NULL) || (n == (SG_MAX_SINGLE_ALLOC - 1))) {
+ kfree(start_sg);
+ start_sg = sg;
+ n = 0;
+ }
+ }
+ }
+
+out:
+ return;
+}
+EXPORT_SYMBOL(blk_rq_unmap_kern_sg);
+
+static int blk_rq_handle_align_tail_only(struct request *rq,
+ struct scatterlist *sg_to_copy,
+ gfp_t gfp, gfp_t page_gfp)
+{
+ int res = 0;
+ struct scatterlist *tail_sg = sg_to_copy;
+ struct scatterlist *new_sg;
+ struct scatterlist *hdr;
+ int new_sg_nents;
+ struct page *pg;
+
+ new_sg_nents = 2;
+
+ new_sg = kmalloc(sizeof(*new_sg) * new_sg_nents, gfp);
+ if (new_sg == NULL)
+ goto out_nomem;
+
+ sg_init_table(new_sg, new_sg_nents);
+
+ hdr = new_sg;
+ new_sg++;
+ new_sg_nents--;
+
+ hdr->page_link = (unsigned long)tail_sg;
+ *new_sg = *tail_sg;
+
+ pg = alloc_pages(page_gfp, get_order(tail_sg->length));
+ if (pg == NULL)
+ goto err_free_new_sg;
+
+ if (rq_data_dir(rq) == WRITE) {
+ void *saddr, *daddr;
+ saddr = kmap_atomic(sg_page(tail_sg), KM_USER0) +
+ tail_sg->offset;
+ daddr = kmap_atomic(pg, KM_USER1);
+ memcpy(daddr, saddr, tail_sg->length);
+ kunmap_atomic(saddr, KM_USER0);
+ kunmap_atomic(daddr, KM_USER1);
+ }
+
+ sg_assign_page(tail_sg, pg);
+ tail_sg->offset = 0;
+
+ rq->end_io_data = hdr;
+ rq->cmd_flags |= REQ_HAS_TAIL_SPACE_FOR_PADDING;
+
+out:
+ return res;
+
+err_free_new_sg:
+ kfree(new_sg);
+
+out_nomem:
+ res = -ENOMEM;
+ goto out;
+}
+
+static int blk_rq_handle_align(struct request *rq, struct scatterlist **psgl,
+ int *pnents, struct scatterlist *sgl_to_copy,
+ int nents_to_copy, gfp_t gfp, gfp_t page_gfp)
+{
+ int res = 0, i;
+ struct scatterlist *sgl = *psgl;
+ int nents = *pnents;
+ struct scatterlist *sg, *prev_sg;
+ struct scatterlist *new_sgl;
+ struct scatterlist *hdr;
+ size_t len = 0, to_copy;
+ int new_sgl_nents, new_sgl_nents_to_alloc, n;
+
+ if (sgl != sgl_to_copy) {
+ /* Copy only the last element */
+ res = blk_rq_handle_align_tail_only(rq, sgl_to_copy,
+ gfp, page_gfp);
+ if (res == 0)
+ goto out;
+ }
+
+ for_each_sg(sgl, sg, nents, i)
+ len += sg->length;
+ to_copy = len;
+
+ /*
+ * Let's keep each SG allocation inside a single page to decrease
+ * probability of failure.
+ */
+
+ new_sgl_nents = PFN_UP(len) + 1;
+ new_sgl_nents_to_alloc = new_sgl_nents +
+ ((new_sgl_nents - 1) / SG_MAX_SINGLE_ALLOC);
+ n = min_t(size_t, SG_MAX_SINGLE_ALLOC, new_sgl_nents_to_alloc);
+
+ new_sgl = kmalloc(sizeof(*new_sgl) * n, gfp);
+ if (new_sgl == NULL)
+ goto out_nomem;
+
+ sg_init_table(new_sgl, n);
+
+ new_sgl_nents_to_alloc -= n;
+ sg = new_sgl;
+ while (new_sgl_nents_to_alloc > 0) {
+ prev_sg = sg;
+ n = min_t(size_t, SG_MAX_SINGLE_ALLOC, new_sgl_nents_to_alloc);
+
+ sg = kmalloc(sizeof(*sg) * n, gfp);
+ if (sg == NULL)
+ goto out_nomem;
+
+ sg_init_table(sg, n);
+ sg_chain(prev_sg, SG_MAX_SINGLE_ALLOC, sg);
+
+ new_sgl_nents_to_alloc -= n;
+ };
+
+ hdr = new_sgl;
+ new_sgl++;
+ new_sgl_nents--;
+
+ hdr->page_link = (unsigned long)sgl;
+ hdr->length = nents;
+
+ for_each_sg(new_sgl, sg, new_sgl_nents, i) {
+ struct page *pg;
+
+ pg = alloc_page(page_gfp);
+ if (pg == NULL)
+ goto err_free_new_sgl;
+
+ sg_assign_page(sg, pg);
+ sg->length = min_t(size_t, PAGE_SIZE, len);
+
+ len -= PAGE_SIZE;
+ }
+
+ if (rq_data_dir(rq) == WRITE) {
+ /*
+ * We need to limit amount of copied data to to_copy, because
+ * sgl might have the last element not marked as last in
+ * SG chaining.
+ */
+ blk_copy_sg(new_sgl, sgl, to_copy, KM_USER0, KM_USER1);
+ }
+
+ rq->end_io_data = hdr;
+ rq->cmd_flags |= REQ_HAS_TAIL_SPACE_FOR_PADDING;
+
+ *psgl = new_sgl;
+ *pnents = new_sgl_nents;
+
+out:
+ return res;
+
+err_free_new_sgl:
+ for_each_sg(new_sgl, sg, new_sgl_nents, i) {
+ struct page *pg = sg_page(sg);
+ if (pg == NULL)
+ break;
+ __free_page(pg);
+ }
+
+out_nomem:
+ res = -ENOMEM;
+ goto out;
+}
+
+static void bio_map_kern_endio(struct bio *bio, int err)
+{
+ bio_put(bio);
+}
+
+static int __blk_rq_map_kern_sg(struct request *rq, struct scatterlist *sgl,
+ int nents, gfp_t gfp, struct scatterlist **sgl_to_copy,
+ int *nents_to_copy)
+{
+ int res;
+ struct request_queue *q = rq->q;
+ int rw = rq_data_dir(rq);
+ int max_nr_vecs, i;
+ size_t tot_len;
+ bool need_new_bio;
+ struct scatterlist *sg, *prev_sg = NULL;
+ struct bio *bio = NULL, *hbio = NULL, *tbio = NULL;
+
+ *sgl_to_copy = NULL;
+
+ if (unlikely((sgl == 0) || (nents <= 0))) {
+ WARN_ON(1);
+ res = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * Let's keep each bio allocation inside a single page to decrease
+ * probability of failure.
+ */
+ max_nr_vecs = min_t(size_t,
+ ((PAGE_SIZE - sizeof(struct bio)) / sizeof(struct bio_vec)),
+ BIO_MAX_PAGES);
+
+ need_new_bio = true;
+ tot_len = 0;
+ for_each_sg(sgl, sg, nents, i) {
+ struct page *page = sg_page(sg);
+ void *page_addr = page_address(page);
+ size_t len = sg->length, l;
+ size_t offset = sg->offset;
+
+ tot_len += len;
+ prev_sg = sg;
+
+ /*
+ * Each segment must be aligned on DMA boundary and
+ * not on stack. The last one may have unaligned
+ * length as long as the total length is aligned to
+ * DMA padding alignment.
+ */
+ if (i == nents - 1)
+ l = 0;
+ else
+ l = len;
+ if (((sg->offset | l) & queue_dma_alignment(q)) ||
+ (page_addr && object_is_on_stack(page_addr + sg->offset))) {
+ res = -EINVAL;
+ goto out_need_copy;
+ }
+
+ while (len > 0) {
+ size_t bytes;
+ int rc;
+
+ if (need_new_bio) {
+ bio = bio_kmalloc(gfp, max_nr_vecs);
+ if (bio == NULL) {
+ res = -ENOMEM;
+ goto out_free_bios;
+ }
+
+ if (rw == WRITE)
+ bio->bi_rw |= 1 << BIO_RW;
+
+ bio->bi_end_io = bio_map_kern_endio;
+
+ if (hbio == NULL)
+ hbio = tbio = bio;
+ else
+ tbio = tbio->bi_next = bio;
+ }
+
+ bytes = min_t(size_t, len, PAGE_SIZE - offset);
+
+ rc = bio_add_pc_page(q, bio, page, bytes, offset);
+ if (rc < bytes) {
+ if (unlikely(need_new_bio || (rc < 0))) {
+ if (rc < 0)
+ res = rc;
+ else
+ res = -EIO;
+ goto out_need_copy;
+ } else {
+ need_new_bio = true;
+ len -= rc;
+ offset += rc;
+ continue;
+ }
+ }
+
+ need_new_bio = false;
+ offset = 0;
+ len -= bytes;
+ page = nth_page(page, 1);
+ }
+ }
+
+ if (hbio == NULL) {
+ res = -EINVAL;
+ goto out_free_bios;
+ }
+
+ /* Total length must be aligned on DMA padding alignment */
+ if ((tot_len & q->dma_pad_mask) &&
+ !(rq->cmd_flags & REQ_HAS_TAIL_SPACE_FOR_PADDING)) {
+ res = -EINVAL;
+ if (sgl->offset == 0) {
+ *sgl_to_copy = prev_sg;
+ *nents_to_copy = 1;
+ goto out_free_bios;
+ } else
+ goto out_need_copy;
+ }
+
+ while (hbio != NULL) {
+ bio = hbio;
+ hbio = hbio->bi_next;
+ bio->bi_next = NULL;
+
+ blk_queue_bounce(q, &bio);
+
+ res = blk_rq_append_bio(q, rq, bio);
+ if (unlikely(res != 0)) {
+ bio->bi_next = hbio;
+ hbio = bio;
+ goto out_free_bios;
+ }
+ }
+
+ rq->buffer = rq->data = NULL;
+
+out:
+ return res;
+
+out_need_copy:
+ *sgl_to_copy = sgl;
+ *nents_to_copy = nents;
+
+out_free_bios:
+ while (hbio != NULL) {
+ bio = hbio;
+ hbio = hbio->bi_next;
+ bio_put(bio);
+ }
+ goto out;
+}
+
+/**
+ * blk_rq_map_kern_sg - map kernel data to a request, for REQ_TYPE_BLOCK_PC
+ * @rq: request to fill
+ * @sgl: area to map
+ * @nents: number of elements in @sgl
+ * @gfp: memory allocation flags
+ *
+ * Description:
+ * Data will be mapped directly if possible. Otherwise a bounce
+ * buffer will be used.
+ */
+int blk_rq_map_kern_sg(struct request *rq, struct scatterlist *sgl,
+ int nents, gfp_t gfp)
+{
+ int res;
+ struct scatterlist *sg_to_copy = NULL;
+ int nents_to_copy = 0;
+
+ if (unlikely((sgl == 0) || (sgl->length == 0) ||
+ (nents <= 0) || (rq->end_io_data != NULL))) {
+ WARN_ON(1);
+ res = -EINVAL;
+ goto out;
+ }
+
+ res = __blk_rq_map_kern_sg(rq, sgl, nents, gfp, &sg_to_copy,
+ &nents_to_copy);
+ if (unlikely(res != 0)) {
+ if (sg_to_copy == NULL)
+ goto out;
+
+ res = blk_rq_handle_align(rq, &sgl, &nents, sg_to_copy,
+ nents_to_copy, gfp, rq->q->bounce_gfp | gfp);
+ if (unlikely(res != 0))
+ goto out;
+
+ res = __blk_rq_map_kern_sg(rq, sgl, nents, gfp, &sg_to_copy,
+ &nents_to_copy);
+ if (res != 0) {
+ blk_rq_unmap_kern_sg(rq, 0);
+ goto out;
+ }
+ }
+
+ rq->buffer = rq->data = NULL;
+
+out:
+ return res;
+}
+EXPORT_SYMBOL(blk_rq_map_kern_sg);
+
+/**
* blk_rq_map_kern - map kernel data to a request, for REQ_TYPE_BLOCK_PC usage
* @q: request queue where request should be inserted
* @rq: request to fill
diff -upkr linux-2.6.30.1/drivers/scsi/scsi_lib.c linux-2.6.30.1/drivers/scsi/scsi_lib.c
--- linux-2.6.30.1/drivers/scsi/scsi_lib.c 2009-06-10 07:05:27.000000000 +0400
+++ linux-2.6.30.1/drivers/scsi/scsi_lib.c 2009-07-08 21:24:29.000000000 +0400
@@ -277,6 +277,100 @@ int scsi_execute_req(struct scsi_device
}
EXPORT_SYMBOL(scsi_execute_req);

+struct scsi_io_context {
+ void *blk_data;
+ void *data;
+ void (*done)(void *data, char *sense, int result, int resid);
+ char sense[SCSI_SENSE_BUFFERSIZE];
+};
+
+static struct kmem_cache *scsi_io_context_cache;
+
+static void scsi_end_async(struct request *req, int error)
+{
+ struct scsi_io_context *sioc = req->end_io_data;
+
+ req->end_io_data = sioc->blk_data;
+ blk_rq_unmap_kern_sg(req, (error == 0));
+
+ if (sioc->done)
+ sioc->done(sioc->data, sioc->sense, req->errors, req->data_len);
+
+ kmem_cache_free(scsi_io_context_cache, sioc);
+ __blk_put_request(req->q, req);
+}
+
+/**
+ * scsi_execute_async - insert request
+ * @sdev: scsi device
+ * @cmd: scsi command
+ * @cmd_len: length of scsi cdb
+ * @data_direction: DMA_TO_DEVICE, DMA_FROM_DEVICE, or DMA_NONE
+ * @sgl: data buffer scatterlist
+ * @nents: number of elements in the sgl
+ * @timeout: request timeout in seconds
+ * @retries: number of times to retry request
+ * @privdata: data passed to done()
+ * @done: callback function when done
+ * @gfp: memory allocation flags
+ * @flags: one or more SCSI_ASYNC_EXEC_FLAG_* flags
+ */
+int scsi_execute_async(struct scsi_device *sdev, const unsigned char *cmd,
+ int cmd_len, int data_direction, struct scatterlist *sgl,
+ int nents, int timeout, int retries, void *privdata,
+ void (*done)(void *, char *, int, int), gfp_t gfp,
+ int flags)
+{
+ struct request *req;
+ struct scsi_io_context *sioc;
+ int err = 0;
+ int write = (data_direction == DMA_TO_DEVICE);
+
+ sioc = kmem_cache_zalloc(scsi_io_context_cache, gfp);
+ if (sioc == NULL)
+ return DRIVER_ERROR << 24;
+
+ req = blk_get_request(sdev->request_queue, write, gfp);
+ if (req == NULL)
+ goto free_sense;
+ req->cmd_type = REQ_TYPE_BLOCK_PC;
+ req->cmd_flags |= REQ_QUIET;
+
+ if (flags & SCSI_ASYNC_EXEC_FLAG_HAS_TAIL_SPACE_FOR_PADDING)
+ req->cmd_flags |= REQ_HAS_TAIL_SPACE_FOR_PADDING;
+
+ if (sgl != NULL) {
+ err = blk_rq_map_kern_sg(req, sgl, nents, gfp);
+ if (err)
+ goto free_req;
+ }
+
+ sioc->blk_data = req->end_io_data;
+ sioc->data = privdata;
+ sioc->done = done;
+
+ req->cmd_len = cmd_len;
+ memset(req->cmd, 0, BLK_MAX_CDB); /* ATAPI hates garbage after CDB */
+ memcpy(req->cmd, cmd, req->cmd_len);
+ req->sense = sioc->sense;
+ req->sense_len = 0;
+ req->timeout = timeout;
+ req->retries = retries;
+ req->end_io_data = sioc;
+
+ blk_execute_rq_nowait(req->q, NULL, req,
+ flags & SCSI_ASYNC_EXEC_FLAG_AT_HEAD, scsi_end_async);
+ return 0;
+
+free_req:
+ blk_put_request(req);
+
+free_sense:
+ kmem_cache_free(scsi_io_context_cache, sioc);
+ return DRIVER_ERROR << 24;
+}
+EXPORT_SYMBOL_GPL(scsi_execute_async);
+
/*
* Function: scsi_init_cmd_errh()
*
@@ -1743,12 +1837,20 @@ int __init scsi_init_queue(void)
{
int i;

+ scsi_io_context_cache = kmem_cache_create("scsi_io_context",
+ sizeof(struct scsi_io_context),
+ 0, 0, NULL);
+ if (!scsi_io_context_cache) {
+ printk(KERN_ERR "SCSI: can't init scsi io context cache\n");
+ return -ENOMEM;
+ }
+
scsi_sdb_cache = kmem_cache_create("scsi_data_buffer",
sizeof(struct scsi_data_buffer),
0, 0, NULL);
if (!scsi_sdb_cache) {
printk(KERN_ERR "SCSI: can't init scsi sdb cache\n");
- return -ENOMEM;
+ goto cleanup_io_context;
}

for (i = 0; i < SG_MEMPOOL_NR; i++) {
@@ -1784,6 +1886,9 @@ cleanup_sdb:
}
kmem_cache_destroy(scsi_sdb_cache);

+cleanup_io_context:
+ kmem_cache_destroy(scsi_io_context_cache);
+
return -ENOMEM;
}

@@ -1791,6 +1896,7 @@ void scsi_exit_queue(void)
{
int i;

+ kmem_cache_destroy(scsi_io_context_cache);
kmem_cache_destroy(scsi_sdb_cache);

for (i = 0; i < SG_MEMPOOL_NR; i++) {
diff -upkr linux-2.6.30.1/include/linux/blkdev.h linux-2.6.30.1/include/linux/blkdev.h
--- linux-2.6.30.1/include/linux/blkdev.h 2009-06-10 07:05:27.000000000 +0400
+++ linux-2.6.30.1/include/linux/blkdev.h 2009-07-08 21:24:29.000000000 +0400
@@ -807,6 +807,9 @@ extern int blk_rq_map_kern(struct reques
extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
struct rq_map_data *, struct sg_iovec *, int,
unsigned int, gfp_t);
+extern int blk_rq_map_kern_sg(struct request *rq,
+ struct scatterlist *sgl, int nents, gfp_t gfp);
+extern void blk_rq_unmap_kern_sg(struct request *req, int do_copy);
extern int blk_execute_rq(struct request_queue *, struct gendisk *,
struct request *, int);
extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
@@ -909,6 +912,9 @@ extern void blk_dump_rq_flags(struct req
extern void generic_unplug_device(struct request_queue *);
extern long nr_blockdev_pages(void);

+extern int blk_copy_sg(struct scatterlist *, struct scatterlist *, size_t,
+ enum km_type, enum km_type);
+
int blk_get_queue(struct request_queue *);
struct request_queue *blk_alloc_queue(gfp_t);
struct request_queue *blk_alloc_queue_node(gfp_t, int);
diff -upkr linux-2.6.30.1/include/scsi/scsi_device.h linux-2.6.30.1/include/scsi/scsi_device.h
--- linux-2.6.30.1/include/scsi/scsi_device.h 2009-06-10 07:05:27.000000000 +0400
+++ linux-2.6.30.1/include/scsi/scsi_device.h 2009-07-08 21:24:29.000000000 +0400
@@ -372,6 +372,17 @@ extern int scsi_execute_req(struct scsi_
struct scsi_sense_hdr *, int timeout, int retries,
int *resid);

+#define SCSI_ASYNC_EXEC_FLAG_AT_HEAD 1
+#define SCSI_ASYNC_EXEC_FLAG_HAS_TAIL_SPACE_FOR_PADDING 2
+
+#define SCSI_EXEC_REQ_FIFO_DEFINED
+extern int scsi_execute_async(struct scsi_device *sdev, const unsigned char *cmd,
+ int cmd_len, int data_direction,
+ struct scatterlist *sgl, int nents, int timeout,
+ int retries, void *privdata,
+ void (*done)(void *, char *, int, int),
+ gfp_t gfp, int flags);
+
static inline int __must_check scsi_device_reprobe(struct scsi_device *sdev)
{
return device_reprobe(&sdev->sdev_gendev);

--
Home page of SCST and target drivers: http://scst.sourceforge.net
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Joe Eykholt

unread,

Jul 9, 2009, 3:40:08 PM7/9/09

This should say:

* Data from the source SG vector will be copied to the destination SG

It seems unlikely to result in real misuse, though!

This may be correct, but what happens if dst_page is compound
and larger than PAGE_SIZE? Could dst_offs be larger than PAGE_SIZE
and cause daddr to be past the mapping? Can that happen?

It seems better to me to do:
daddr = kmap_atomic(dst_page + (dst_offs >> PAGE_SHIFT), d_km_type);

I'm not an expert on sg list use, though, so what you have could
be perfectly all right.

This should be tested on both i386 and x86_64 both. Of course,
this comment applies to other places in the file.

> +
> + if ((src_offs == 0) && (dst_offs == 0) &&
> + (src_len >= PAGE_SIZE) && (dst_len >= PAGE_SIZE) &&
> + (copy_len >= PAGE_SIZE)) {

The above has ten extra parens and would look nicer without them.
The same comment applies to other places in the patch.
Also, I think the approved style is to do !src_offs instead of
src_offs == 0, though I also prefer the latter.

Shouldn't these declarations be at the top?
Otherwise, there will be a cleanup patch to move it at some point.

I'm not reviewing this in detail, by the way. Just a few things that
caught my eye.

Joe Eykholt

unread,

Jul 10, 2009, 7:10:09 AM7/10/09

Just realized this isn't right either because the low bits of the offset
are lost, but you get the issue. Maybe if the page is compound
the mapping is always contiguous anyway.

Regards,
Joe

Boaz Harrosh

unread,

Jul 12, 2009, 9:10:12 AM7/12/09

The scsi bits are not needed and just add complexity.
allocate the request, call blk_rq_map_kern_sg() and
blk_execute_xxx directly from your driver. Since you exacly
know your code path lots of if(s) and flags are saved and that ugly
scsi_io_context allocation.

If you absolutely need that scsi_normalize_sense() of scsi_execute_req
call it from driver, its already exported.

This has nothing to do with block layer lib/scatterlist.c is a better
candidate.

see if you can refactor it better together with sg_copy_to/from_buffer

> +/**
> + * blk_rq_unmap_kern_sg - "unmaps" data buffers in the request
> + * @req: request to unmap
> + * @do_copy: sets copy data between buffers, if needed, or not
> + *
> + * Description:
> + * It frees all additional buffers allocated for SG->BIO mapping.
> + */
> +void blk_rq_unmap_kern_sg(struct request *req, int do_copy)
> +{
> + struct scatterlist *hdr = (struct scatterlist *)req->end_io_data;
> +

You can't use req->end_io_data here! req->end_io_data is reserved for
blk_execute_async callers. It can not be used for private block use.

(Don't you use it in this patchset from scsi_execute_async ?)

> + if (hdr == NULL)
> + goto out;
> +
> + if (hdr->length == 0) {
> + /* Tail element only was copied */
> + struct scatterlist *new_sg = &hdr[1];
> + struct scatterlist *orig_sg = (struct scatterlist *)hdr->page_link;
> +

I don't like that overloading of (scatterlist *) like that. Put a well defined struct
or union with proper flags and types, so all possibilities are clear at both sides of
the code. Document that structure for the different options.

> + if ((rq_data_dir(req) == READ) && do_copy) {
> + void *saddr, *daddr;
> +
> + saddr = kmap_atomic(sg_page(orig_sg), KM_BIO_SRC_IRQ);
> + daddr = kmap_atomic(sg_page(new_sg), KM_BIO_DST_IRQ) +
> + new_sg->offset;
> + memcpy(daddr, saddr, orig_sg->length);
> + kunmap_atomic(saddr, KM_BIO_SRC_IRQ);
> + kunmap_atomic(daddr, KM_BIO_DST_IRQ);
> + }
> +
> + __free_pages(sg_page(orig_sg), get_order(orig_sg->length));
> + *orig_sg = *new_sg;
> + kfree(hdr);
> + } else {
> + /* The whole SG was copied */
> + struct scatterlist *new_sgl = &hdr[1];
> + struct scatterlist *orig_sgl = (struct scatterlist *)hdr->page_link;
> + struct scatterlist *sg, *start_sg;

Here too ugly as hell, use a proper structure with the needed types.

below should use the sg_alloc_table / sg_free_table constructs for
the proper chaining/setting of scatterlist chains. Failing to do so
is both a code duplication and hinder of future changes to these constructs.
see all places that use sg_alloc_table/sg_free_table in source like scsi_lib.c.

I'll stop the review here. You need to solve the req->end_io_data use and change
this to sg_alloc_table/sg_free_table (perhaps the one with the callback i'm not sure)

I'll continue review on the next patchset.

Cheers
Boaz

Vladislav Bolkhovitin

unread,

Jul 14, 2009, 12:20:08 PM7/14/09

Are you against any helper functions? scsi_lib.c has only scsi_execute_async(),
which is a small helper function, which only hides internal machinery details.
All the flags it has are needed and can't be avoided, especially
SCSI_ASYNC_EXEC_FLAG_HAS_TAIL_SPACE_FOR_PADDING

The allocation of scsi_io_context is unavoidable, because sense buffer should
be allocated anyway.

> If you absolutely need that scsi_normalize_sense() of scsi_execute_req
> call it from driver, its already exported.

No, I don't need scsi_normalize_sense() and not use it anywhere.

Done. V2 of the patch is coming.

> see if you can refactor it better together with sg_copy_to/from_buffer

Unfortunately, they can't be used, because in there is a need to copy one SG
entry to another (sg_copy_elem() in the V2). Using sg_copy_elem() sg_copy()
implemented in just a few tens LOC.

>> +/**
>> + * blk_rq_unmap_kern_sg - "unmaps" data buffers in the request
>> + * @req: request to unmap
>> + * @do_copy: sets copy data between buffers, if needed, or not
>> + *
>> + * Description:
>> + * It frees all additional buffers allocated for SG->BIO mapping.
>> + */
>> +void blk_rq_unmap_kern_sg(struct request *req, int do_copy)
>> +{
>> + struct scatterlist *hdr = (struct scatterlist *)req->end_io_data;
>> +
>
> You can't use req->end_io_data here! req->end_io_data is reserved for
> blk_execute_async callers. It can not be used for private block use.

Why? I see blk_execute_rq() happily uses it. Plus, I implemented stacking of
it in scsi_execute_async().

Do you have better suggestion?

> (Don't you use it in this patchset from scsi_execute_async ?)

As I wrote, I used patches posted by Tejun Heo.

>> + if (hdr == NULL)
>> + goto out;
>> +
>> + if (hdr->length == 0) {
>> + /* Tail element only was copied */
>> + struct scatterlist *new_sg = &hdr[1];
>> + struct scatterlist *orig_sg = (struct scatterlist *)hdr->page_link;
>> +
>
> I don't like that overloading of (scatterlist *) like that. Put a well defined struct
> or union with proper flags and types, so all possibilities are clear at both sides of
> the code. Document that structure for the different options.

Done.

>> + if ((rq_data_dir(req) == READ) && do_copy) {
>> + void *saddr, *daddr;
>> +
>> + saddr = kmap_atomic(sg_page(orig_sg), KM_BIO_SRC_IRQ);
>> + daddr = kmap_atomic(sg_page(new_sg), KM_BIO_DST_IRQ) +
>> + new_sg->offset;
>> + memcpy(daddr, saddr, orig_sg->length);
>> + kunmap_atomic(saddr, KM_BIO_SRC_IRQ);
>> + kunmap_atomic(daddr, KM_BIO_DST_IRQ);
>> + }
>> +
>> + __free_pages(sg_page(orig_sg), get_order(orig_sg->length));
>> + *orig_sg = *new_sg;
>> + kfree(hdr);
>> + } else {
>> + /* The whole SG was copied */
>> + struct scatterlist *new_sgl = &hdr[1];
>> + struct scatterlist *orig_sgl = (struct scatterlist *)hdr->page_link;
>> + struct scatterlist *sg, *start_sg;
>
> Here too ugly as hell, use a proper structure with the needed types.

The same.

Done. I didn't know about sg_alloc_table()/sg_free_table().

> I'll stop the review here. You need to solve the req->end_io_data use and change
> this to sg_alloc_table/sg_free_table (perhaps the one with the callback i'm not sure)
>
> I'll continue review on the next patchset.

Thanks!
Vlad

Vladislav Bolkhovitin

unread,

Jul 14, 2009, 12:20:10 PM7/14/09

Fixed, thanks. V2 of the patch is coming.

Correct will be something like:

saddr = kmap_atomic(src_page + (src_offs >> PAGE_SHIFT), s_km_type) + (src_offs & ~PAGE_MASK);

I fixed it together with a bunch of similar issues I found reviewing the code
after your comments.

>> I'm not an expert on sg list use, though, so what you have could
>> be perfectly all right.
>>
>> This should be tested on both i386 and x86_64 both. Of course,
>> this comment applies to other places in the file.
>>
>>> +
>>> + if ((src_offs == 0) && (dst_offs == 0) &&
>>> + (src_len >= PAGE_SIZE) && (dst_len >= PAGE_SIZE) &&
>>> + (copy_len >= PAGE_SIZE)) {
>>
>> The above has ten extra parens and would look nicer without them.
>> The same comment applies to other places in the patch.

Maybe, but I like the "extra" parens (BTW, in many languages they are required, not
extra), because they give for eyes "points of stop" during reading and, so,
make overlooking various errors much harder.

>> Also, I think the approved style is to do !src_offs instead of
>> src_offs == 0, though I also prefer the latter.

I've not found anything about it.

I prefer having for variables the visibility scope as small as possible, hence
the style.

>> I'm not reviewing this in detail, by the way. Just a few things that
>> caught my eye.

Thanks very much!
Vlad

Vladislav Bolkhovitin

unread,

Jul 14, 2009, 12:20:09 PM7/14/09

This patch reimplements scsi_execute_async(). In the new version it's a lot less
hackish and also has additional features. Namely:

1. Possibility to insert commands both in tail and in head of the queue.

2. Possibility to explicitly specify if the last SG element has space for padding.

This patch based on the previous patches posted by Tejun Heo. Comparing to them
it has the following improvements:

1. It uses BIOs chaining instead of kmalloc()ing the whole bio.

2. It uses SGs chaining instead of kmalloc()ing one big SG in case if direct
mapping failed (e.g. because of DMA alignment or padding).

3. If direct mapping failed, if possible, it copies only the last SG element,
not the whole SG.

4. When needed, copy_page() is used instead of memcpy() to copy the whole pages.

Also this patch adds and exports functions sg_copy() and sg_copy_elem(), which
cop one SG to another and one SG element to another respectively.

This patch requires previously sent patch with subject "[PATCH]: Rename REQ_COPY_USER
to more descriptive REQ_HAS_TAIL_SPACE_FOR_PADDING".

It's against 2.6.30.1, but if necessary, I can update it to any necessary
kernel version.

Signed-off-by: Vladislav Bolkhovitin <v...@vlnb.net>

diff -upkr linux-2.6.30.1/block/blk-map.c linux-2.6.30.1/block/blk-map.c
--- linux-2.6.30.1/block/blk-map.c 2009-07-10 21:13:25.000000000 +0400
+++ linux-2.6.30.1/block/blk-map.c 2009-07-14 19:24:36.000000000 +0400

@@ -5,6 +5,7 @@
#include <linux/module.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
+#include <linux/scatterlist.h>
#include <scsi/sg.h> /* for struct sg_iovec */

#include "blk.h"

@@ -272,6 +273,413 @@ int blk_rq_unmap_user(struct bio *bio)
}
EXPORT_SYMBOL(blk_rq_unmap_user);

+struct blk_kern_sg_hdr {
+ struct scatterlist *orig_sgp;
+ union {
+ struct sg_table new_sg_table;
+ struct scatterlist *saved_sg;
+ };
+ bool tail_only;
+};
+
+#define BLK_KERN_SG_HDR_ENTRIES (1 + (sizeof(struct blk_kern_sg_hdr) - 1) / \
+ sizeof(struct scatterlist))

+
+/**
+ * blk_rq_unmap_kern_sg - "unmaps" data buffers in the request
+ * @req: request to unmap
+ * @do_copy: sets copy data between buffers, if needed, or not
+ *
+ * Description:
+ * It frees all additional buffers allocated for SG->BIO mapping.
+ */
+void blk_rq_unmap_kern_sg(struct request *req, int do_copy)
+{

+ struct blk_kern_sg_hdr *hdr = (struct blk_kern_sg_hdr *)req->end_io_data;

+
+ if (hdr == NULL)
+ goto out;
+

+ if (hdr->tail_only) {

+ /* Tail element only was copied */

+ struct scatterlist *saved_sg = hdr->saved_sg;
+ struct scatterlist *tail_sg = hdr->orig_sgp;

+
+ if ((rq_data_dir(req) == READ) && do_copy)

+ sg_copy_elem(saved_sg, tail_sg, tail_sg->length,
+ KM_BIO_DST_IRQ, KM_BIO_SRC_IRQ);
+
+ __free_pages(sg_page(tail_sg), get_order(tail_sg->length));
+ *tail_sg = *saved_sg;

+ kfree(hdr);
+ } else {
+ /* The whole SG was copied */

+ struct sg_table new_sg_table = hdr->new_sg_table;
+ struct scatterlist *new_sgl = new_sg_table.sgl +
+ BLK_KERN_SG_HDR_ENTRIES;
+ struct scatterlist *orig_sgl = hdr->orig_sgp;

+
+ if ((rq_data_dir(req) == READ) && do_copy)

+ sg_copy(orig_sgl, new_sgl, 0, KM_BIO_DST_IRQ,
+ KM_BIO_SRC_IRQ);
+
+ sg_free_table(&new_sg_table);

+ }
+
+out:
+ return;
+}
+EXPORT_SYMBOL(blk_rq_unmap_kern_sg);
+
+static int blk_rq_handle_align_tail_only(struct request *rq,
+ struct scatterlist *sg_to_copy,
+ gfp_t gfp, gfp_t page_gfp)
+{
+ int res = 0;
+ struct scatterlist *tail_sg = sg_to_copy;

+ struct scatterlist *saved_sg;
+ struct blk_kern_sg_hdr *hdr;
+ int saved_sg_nents;

+ struct page *pg;
+

+ saved_sg_nents = 1 + BLK_KERN_SG_HDR_ENTRIES;
+
+ saved_sg = kmalloc(sizeof(*saved_sg) * saved_sg_nents, gfp);
+ if (saved_sg == NULL)
+ goto out_nomem;
+
+ sg_init_table(saved_sg, saved_sg_nents);
+
+ hdr = (struct blk_kern_sg_hdr *)saved_sg;
+ saved_sg += BLK_KERN_SG_HDR_ENTRIES;
+ saved_sg_nents -= BLK_KERN_SG_HDR_ENTRIES;
+
+ hdr->tail_only = true;
+ hdr->orig_sgp = tail_sg;
+ hdr->saved_sg = saved_sg;
+
+ *saved_sg = *tail_sg;

+
+ pg = alloc_pages(page_gfp, get_order(tail_sg->length));
+ if (pg == NULL)

+ goto err_free_saved_sg;

+
+ sg_assign_page(tail_sg, pg);
+ tail_sg->offset = 0;
+

+ if (rq_data_dir(rq) == WRITE)

+ sg_copy_elem(tail_sg, saved_sg, saved_sg->length,
+ KM_USER1, KM_USER0);

+
+ rq->end_io_data = hdr;
+ rq->cmd_flags |= REQ_HAS_TAIL_SPACE_FOR_PADDING;
+
+out:
+ return res;
+

+err_free_saved_sg:
+ kfree(saved_sg);

+
+out_nomem:
+ res = -ENOMEM;
+ goto out;
+}
+
+static int blk_rq_handle_align(struct request *rq, struct scatterlist **psgl,
+ int *pnents, struct scatterlist *sgl_to_copy,
+ int nents_to_copy, gfp_t gfp, gfp_t page_gfp)
+{
+ int res = 0, i;
+ struct scatterlist *sgl = *psgl;
+ int nents = *pnents;

+ struct sg_table sg_table;
+ struct scatterlist *sg;
+ struct scatterlist *new_sgl;

+ size_t len = 0, to_copy;

+ int new_sgl_nents;
+ struct blk_kern_sg_hdr *hdr;

+
+ if (sgl != sgl_to_copy) {

+ /* copy only the last element */

+ res = blk_rq_handle_align_tail_only(rq, sgl_to_copy,
+ gfp, page_gfp);
+ if (res == 0)
+ goto out;

+ /* else go through */

+ }
+
+ for_each_sg(sgl, sg, nents, i)
+ len += sg->length;
+ to_copy = len;
+

+ new_sgl_nents = PFN_UP(len) + BLK_KERN_SG_HDR_ENTRIES;
+
+ res = sg_alloc_table(&sg_table, new_sgl_nents, gfp);

+ if (res != 0)

+ goto out;
+
+ new_sgl = sg_table.sgl;
+ hdr = (struct blk_kern_sg_hdr *)new_sgl;
+ new_sgl += BLK_KERN_SG_HDR_ENTRIES;
+ new_sgl_nents -= BLK_KERN_SG_HDR_ENTRIES;
+
+ hdr->tail_only = false;
+ hdr->orig_sgp = sgl;
+ hdr->new_sg_table = sg_table;

+
+ for_each_sg(new_sgl, sg, new_sgl_nents, i) {
+ struct page *pg;
+
+ pg = alloc_page(page_gfp);
+ if (pg == NULL)
+ goto err_free_new_sgl;
+
+ sg_assign_page(sg, pg);
+ sg->length = min_t(size_t, PAGE_SIZE, len);
+
+ len -= PAGE_SIZE;
+ }
+
+ if (rq_data_dir(rq) == WRITE) {
+ /*
+ * We need to limit amount of copied data to to_copy, because
+ * sgl might have the last element not marked as last in
+ * SG chaining.
+ */

+ sg_copy(new_sgl, sgl, to_copy, KM_USER0, KM_USER1);

+ }
+
+ rq->end_io_data = hdr;
+ rq->cmd_flags |= REQ_HAS_TAIL_SPACE_FOR_PADDING;
+
+ *psgl = new_sgl;
+ *pnents = new_sgl_nents;
+
+out:
+ return res;
+
+err_free_new_sgl:
+ for_each_sg(new_sgl, sg, new_sgl_nents, i) {
+ struct page *pg = sg_page(sg);
+ if (pg == NULL)
+ break;
+ __free_page(pg);
+ }

+ sg_free_table(&sg_table);
+

/**
* blk_rq_map_kern - map kernel data to a request, for REQ_TYPE_BLOCK_PC usage
* @q: request queue where request should be inserted

diff -upkr linux-2.6.30.1/drivers/scsi/scsi_lib.c linux-2.6.30.1/drivers/scsi/scsi_lib.c
--- linux-2.6.30.1/drivers/scsi/scsi_lib.c 2009-07-10 21:13:25.000000000 +0400

--- linux-2.6.30.1/include/linux/blkdev.h 2009-07-10 21:13:25.000000000 +0400
+++ linux-2.6.30.1/include/linux/blkdev.h 2009-07-13 13:56:45.000000000 +0400

@@ -807,6 +807,9 @@ extern int blk_rq_map_kern(struct reques
extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
struct rq_map_data *, struct sg_iovec *, int,
unsigned int, gfp_t);
+extern int blk_rq_map_kern_sg(struct request *rq,
+ struct scatterlist *sgl, int nents, gfp_t gfp);
+extern void blk_rq_unmap_kern_sg(struct request *req, int do_copy);
extern int blk_execute_rq(struct request_queue *, struct gendisk *,
struct request *, int);
extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,

diff -upkr linux-2.6.30.1/include/linux/scatterlist.h linux-2.6.30.1/include/linux/scatterlist.h
--- linux-2.6.30.1/include/linux/scatterlist.h 2009-06-10 07:05:27.000000000 +0400
+++ linux-2.6.30.1/include/linux/scatterlist.h 2009-07-13 13:56:24.000000000 +0400
@@ -218,6 +218,13 @@ size_t sg_copy_from_buffer(struct scatte
size_t sg_copy_to_buffer(struct scatterlist *sgl, unsigned int nents,
void *buf, size_t buflen);

+int sg_copy_elem(struct scatterlist *dst_sg, struct scatterlist *src_sg,
+ size_t copy_len, enum km_type d_km_type,
+ enum km_type s_km_type);
+int sg_copy(struct scatterlist *dst_sg,

+ struct scatterlist *src_sg, size_t copy_len,
+ enum km_type d_km_type, enum km_type s_km_type);
+

/*
* Maximum number of entries that will be allocated in one piece, if
* a list larger than this is required then chaining will be utilized.
diff -upkr linux-2.6.30.1/include/scsi/scsi_device.h linux-2.6.30.1/include/scsi/scsi_device.h
--- linux-2.6.30.1/include/scsi/scsi_device.h 2009-07-10 21:13:25.000000000 +0400

+++ linux-2.6.30.1/include/scsi/scsi_device.h 2009-07-08 21:24:29.000000000 +0400
@@ -372,6 +372,17 @@ extern int scsi_execute_req(struct scsi_
struct scsi_sense_hdr *, int timeout, int retries,
int *resid);

+#define SCSI_ASYNC_EXEC_FLAG_AT_HEAD 1
+#define SCSI_ASYNC_EXEC_FLAG_HAS_TAIL_SPACE_FOR_PADDING 2
+
+#define SCSI_EXEC_REQ_FIFO_DEFINED
+extern int scsi_execute_async(struct scsi_device *sdev, const unsigned char *cmd,
+ int cmd_len, int data_direction,
+ struct scatterlist *sgl, int nents, int timeout,
+ int retries, void *privdata,
+ void (*done)(void *, char *, int, int),
+ gfp_t gfp, int flags);
+
static inline int __must_check scsi_device_reprobe(struct scsi_device *sdev)
{
return device_reprobe(&sdev->sdev_gendev);

diff -upkr linux-2.6.30.1/lib/scatterlist.c linux-2.6.30.1/lib/scatterlist.c
--- linux-2.6.30.1/lib/scatterlist.c 2009-06-10 07:05:27.000000000 +0400
+++ linux-2.6.30.1/lib/scatterlist.c 2009-07-14 19:22:49.000000000 +0400
@@ -485,3 +485,150 @@ size_t sg_copy_to_buffer(struct scatterl
return sg_copy_buffer(sgl, nents, buf, buflen, 1);
}
EXPORT_SYMBOL(sg_copy_to_buffer);
+
+/*
+ * Can switch to the next dst_sg element, so, to copy to strictly only
+ * one dst_sg element, it must be either last in the chain, or
+ * copy_len == dst_sg->length.
+ */
+static int __sg_copy_elem(struct scatterlist **pdst_sg, size_t *pdst_len,
+ size_t *pdst_offs, struct scatterlist *src_sg,
+ size_t copy_len,

+ enum km_type d_km_type, enum km_type s_km_type)
+{
+ int res = 0;

+ struct scatterlist *dst_sg;

+ size_t src_len, dst_len, src_offs, dst_offs;
+ struct page *src_page, *dst_page;
+
+ if (copy_len == 0)
+ copy_len = 0x7FFFFFFF; /* copy all */
+

+ dst_sg = *pdst_sg;
+ dst_len = *pdst_len;
+ dst_offs = *pdst_offs;

+ dst_page = sg_page(dst_sg);
+

+ src_page = sg_page(src_sg);
+ src_len = src_sg->length;
+ src_offs = src_sg->offset;
+
+ do {
+ void *saddr, *daddr;
+ size_t n;
+

+ saddr = kmap_atomic(src_page +

+ (src_offs >> PAGE_SHIFT), s_km_type) +

+ (src_offs & ~PAGE_MASK);
+ daddr = kmap_atomic(dst_page +
+ (dst_offs >> PAGE_SHIFT), d_km_type) +
+ (dst_offs & ~PAGE_MASK);
+
+ if (((src_offs & ~PAGE_MASK) == 0) &&
+ ((dst_offs & ~PAGE_MASK) == 0) &&

+ (src_len >= PAGE_SIZE) && (dst_len >= PAGE_SIZE) &&
+ (copy_len >= PAGE_SIZE)) {
+ copy_page(daddr, saddr);
+ n = PAGE_SIZE;
+ } else {

+ n = min_t(size_t, PAGE_SIZE - (dst_offs & ~PAGE_MASK),
+ PAGE_SIZE - (src_offs & ~PAGE_MASK));

+ n = min(n, src_len);
+ n = min(n, dst_len);
+ n = min_t(size_t, n, copy_len);
+ memcpy(daddr, saddr, n);
+ }
+ dst_offs += n;
+ src_offs += n;
+
+ kunmap_atomic(saddr, s_km_type);
+ kunmap_atomic(daddr, d_km_type);
+
+ res += n;
+ copy_len -= n;
+ if (copy_len == 0)
+ goto out;
+

+ src_len -= n;
+ dst_len -= n;
+ if (dst_len == 0) {
+ dst_sg = sg_next(dst_sg);
+ if (dst_sg == NULL)
+ goto out;
+ dst_page = sg_page(dst_sg);
+ dst_len = dst_sg->length;
+ dst_offs = dst_sg->offset;
+ }
+ } while (src_len > 0);
+

+out:
+ *pdst_sg = dst_sg;
+ *pdst_len = dst_len;
+ *pdst_offs = dst_offs;
+ return res;
+}
+
+/**
+ * sg_copy_elem - copy one SG element to another
+ * @dst_sg: destination SG element
+ * @src_sg: source SG element

+ * @copy_len: maximum amount of data to copy. If 0, then copy all.
+ * @d_km_type: kmap_atomic type for the destination SG
+ * @s_km_type: kmap_atomic type for the source SG
+ *
+ * Description:

+ * Data from the source SG element will be copied to the destination SG
+ * element. Returns number of bytes copied. Can switch to the next dst_sg
+ * element, so, to copy to strictly only one dst_sg element, it must be
+ * either last in the chain, or copy_len == dst_sg->length.
+ */
+int sg_copy_elem(struct scatterlist *dst_sg, struct scatterlist *src_sg,
+ size_t copy_len, enum km_type d_km_type,
+ enum km_type s_km_type)
+{
+ size_t dst_len = dst_sg->length, dst_offs = dst_sg->offset;
+
+ return __sg_copy_elem(&dst_sg, &dst_len, &dst_offs, src_sg,
+ copy_len, d_km_type, s_km_type);
+}
+EXPORT_SYMBOL(sg_copy_elem);
+
+/**
+ * sg_copy - copy one SG vector to another

+ * @dst_sg: destination SG
+ * @src_sg: source SG
+ * @copy_len: maximum amount of data to copy. If 0, then copy all.
+ * @d_km_type: kmap_atomic type for the destination SG
+ * @s_km_type: kmap_atomic type for the source SG
+ *
+ * Description:

+ * Data from the source SG vector will be copied to the destination SG

+ * vector. End of the vectors will be determined by sg_next() returning
+ * NULL. Returns number of bytes copied.
+ */

+int sg_copy(struct scatterlist *dst_sg,

+ struct scatterlist *src_sg, size_t copy_len,
+ enum km_type d_km_type, enum km_type s_km_type)
+{
+ int res = 0;

+ size_t dst_len, dst_offs;

+
+ if (copy_len == 0)
+ copy_len = 0x7FFFFFFF; /* copy all */
+

+ dst_len = dst_sg->length;
+ dst_offs = dst_sg->offset;
+

+ do {
+ copy_len -= __sg_copy_elem(&dst_sg, &dst_len, &dst_offs,
+ src_sg, copy_len, d_km_type, s_km_type);
+ if ((copy_len == 0) || (dst_sg == NULL))
+ goto out;
+

+ src_sg = sg_next(src_sg);
+ } while (src_sg != NULL);
+
+out:
+ return res;
+}

+EXPORT_SYMBOL(sg_copy);

Vladislav Bolkhovitin

unread,

Jul 14, 2009, 2:30:13 PM7/14/09

Oops, sorry, forgot to write the change log from v1. V2 has fixes
according to comments made by Joe Eykholt and Boaz Harrosh.

Vlad

Boaz Harrosh

unread,

Jul 15, 2009, 4:30:15 AM7/15/09

Yes I'm sure. This will not be accepted. Just do it directly in the driver
like all other ULD's and block layer users. Use direct blk API. All these little
things and flags you are talking about are block specific, there is no single thing
SCSI about them.

> The allocation of scsi_io_context is unavoidable, because sense buffer should
> be allocated anyway.
>

NO!, driver needs to do it's job and provide a sense buffer and call blk API
just like all the other drivers. this has nothing to do with SCSI.

>> If you absolutely need that scsi_normalize_sense() of scsi_execute_req
>> call it from driver, its already exported.
>
> No, I don't need scsi_normalize_sense() and not use it anywhere.
>

Sorry yes I was spaced out for a sec.

As I said users of blk_execute_rq_nowait() blk_execute_rq is a user of
blk_execute_rq_nowait just as the other guy.

"implemented stacking" is exactly the disaster I'm talking about.
Also it totaly breaks the blk API. Now I need to to specific code
when mapping with this API as opposed to other mappings when I execute

> Do you have better suggestion?
>

I have not look at it deeply but you'll need another system. Perhaps
like map_user/unmap_user where you give unmap_user the original bio.
Each user of map_user needs to keep a pointer to the original bio
on mapping. Maybe some other options as well. You can use the bio's
private data pointer, when building the first bio from scatter-list.

I will go head and review your new patch
Boaz

Boaz Harrosh

unread,

Jul 15, 2009, 5:10:13 AM7/15/09

There is a struct scatterlist * inside struct sg_table
just use that. No need for a union. (It's not like your saving
space). In the tail case nents == 1.
(orig_nents==0 and loose the tail_only)

> + bool tail_only;
> +};
> +
> +#define BLK_KERN_SG_HDR_ENTRIES (1 + (sizeof(struct blk_kern_sg_hdr) - 1) / \
> + sizeof(struct scatterlist))
> +
> +/**
> + * blk_rq_unmap_kern_sg - "unmaps" data buffers in the request
> + * @req: request to unmap
> + * @do_copy: sets copy data between buffers, if needed, or not
> + *
> + * Description:
> + * It frees all additional buffers allocated for SG->BIO mapping.
> + */
> +void blk_rq_unmap_kern_sg(struct request *req, int do_copy)

+void blk_rq_unmap_kern_sg(struct request *req, struct blk_kern_sg_hdr *hdr, int do_copy)

> +{
> + struct blk_kern_sg_hdr *hdr = (struct blk_kern_sg_hdr *)req->end_io_data;
> +

Again still req->end_io_data can not be used. You can add a pointer to the call
or use/add another member.

Perhaps return it to the user or pass an **void output
parameter to be set, or again another member

I do not understand. If you are bouncing on the bio level.
why do you need to do all the alignment checks and
sg-bouncing + tail handling all this is done again on the bio
level.

It looks to me that perhaps you did not understand Tejun's code
His code, as I remember, was on the bio level, but here you
are duplicating what is done in bio level.

But maybe I don't understand. Tejun?

There is nothing scsi here. Do it in the caller

Block API

> + if (flags & SCSI_ASYNC_EXEC_FLAG_HAS_TAIL_SPACE_FOR_PADDING)
> + req->cmd_flags |= REQ_HAS_TAIL_SPACE_FOR_PADDING;
> +
> + if (sgl != NULL) {
> + err = blk_rq_map_kern_sg(req, sgl, nents, gfp);
> + if (err)
> + goto free_req;
> + }
> +

All block API nothing scsi here

> + sioc->blk_data = req->end_io_data;
> + sioc->data = privdata;
> + sioc->done = done;
> +
> + req->cmd_len = cmd_len;
> + memset(req->cmd, 0, BLK_MAX_CDB); /* ATAPI hates garbage after CDB */
> + memcpy(req->cmd, cmd, req->cmd_len);

Does not support large commands.
Also a blk API nothing scsi about it

> + req->sense = sioc->sense;
> + req->sense_len = 0;
> + req->timeout = timeout;
> + req->retries = retries;
> + req->end_io_data = sioc;
> +
> + blk_execute_rq_nowait(req->q, NULL, req,
> + flags & SCSI_ASYNC_EXEC_FLAG_AT_HEAD, scsi_end_async);
> + return 0;
> +
> +free_req:
> + blk_put_request(req);
> +
> +free_sense:
> + kmem_cache_free(scsi_io_context_cache, sioc);
> + return DRIVER_ERROR << 24;
> +}
> +EXPORT_SYMBOL_GPL(scsi_execute_async);
> +

The complete scsi bits are not needed. They have nothing to do with
scsi. If at all this should go into black layer.
scsi_lib is not a wrapper around blk API.

Just use blk API directly inside the caller

Is not sg_copy_elem a copy of an sg with one element. Why do we need
two exports.

I would pass a nents count to below and discard this one.

> +
> +/**
> + * sg_copy - copy one SG vector to another
> + * @dst_sg: destination SG
> + * @src_sg: source SG
> + * @copy_len: maximum amount of data to copy. If 0, then copy all.
> + * @d_km_type: kmap_atomic type for the destination SG
> + * @s_km_type: kmap_atomic type for the source SG
> + *
> + * Description:
> + * Data from the source SG vector will be copied to the destination SG
> + * vector. End of the vectors will be determined by sg_next() returning
> + * NULL. Returns number of bytes copied.
> + */
> +int sg_copy(struct scatterlist *dst_sg,
> + struct scatterlist *src_sg, size_t copy_len,

Total length is nice, but also a nents count

> + enum km_type d_km_type, enum km_type s_km_type)
> +{
> + int res = 0;
> + size_t dst_len, dst_offs;
> +
> + if (copy_len == 0)
> + copy_len = 0x7FFFFFFF; /* copy all */
> +
> + dst_len = dst_sg->length;
> + dst_offs = dst_sg->offset;
> +
> + do {
> + copy_len -= __sg_copy_elem(&dst_sg, &dst_len, &dst_offs,
> + src_sg, copy_len, d_km_type, s_km_type);
> + if ((copy_len == 0) || (dst_sg == NULL))
> + goto out;
> +
> + src_sg = sg_next(src_sg);
> + } while (src_sg != NULL);
> +
> +out:
> + return res;
> +}
> +EXPORT_SYMBOL(sg_copy);
>

Boaz

Joe Eykholt

unread,

Jul 15, 2009, 2:00:16 PM7/15/09

The return value res is always 0 here, contrary to the description.
Maybe it should be void.

Tejun Heo

unread,

Jul 16, 2009, 4:00:18 AM7/16/09

Hello, Vladislav.

Sorry about the delay.

Vladislav Bolkhovitin wrote:
> This patch reimplements scsi_execute_async(). In the new version
> it's a lot less hackish and also has additional features. Namely:

:-)

The original patchset was focused more on unifying user and kernel and
user sg mapping handling. It seems you implemented the kernel part
completely separately. Wouldn't it be better to unify where possible?
Or is there some fundamental reason that can't be done that I missed?
Also, code organization-wise, I think good part of the posted code
belongs to bio.c.

The tail-only copying is nice but I'm not entirely sure whether such
full-blown approach is necessary. The tail padding was added
primarily for dumb ATAPI devices which want to transfer more bytes
than requested. Having extra space at the end makes the driver's job
much easier as it can just overflow into the area. Some controller do
have "drain after this" flag in the dma table but some simply can't
handle such situations properly without explicit overflow area.

So, being the horrid hack it is and highly unlikely to be used in
performance sensitive path, I think it would be better to keep the
implementation simple and dull. It just isn't something worth
investing complexity over. Of course, if you have a use case which
requires high performance tail padding, it's a different story.

Thanks.

--
tejun

Tejun Heo

unread,

Jul 16, 2009, 4:00:18 AM7/16/09

Hello, Boaz.

Boaz Harrosh wrote:
> It looks to me that perhaps you did not understand Tejun's code
> His code, as I remember, was on the bio level, but here you
> are duplicating what is done in bio level.

I think he looked at my mess :-) and decided it would be cleaner to
implement kernel sg mapping completely separately from user mapping.
Maybe it is, I'm not sure. I sure hope not tho.

Thanks.

--
tejun

Vladislav Bolkhovitin

unread,

Jul 16, 2009, 2:20:06 PM7/16/09

Hello, Tejun

Tejun Heo, on 07/16/2009 11:52 AM wrote:
> Hello, Vladislav.
>
> Sorry about the delay.
>
> Vladislav Bolkhovitin wrote:
>> This patch reimplements scsi_execute_async(). In the new version
>> it's a lot less hackish and also has additional features. Namely:
>
> :-)

The "hackish" was not about your work, it was about the original
implementation, which was too often blamed as "hackish" ;-)

I don't see a place where they can be unified from using the existing
user mapping functionality POV. And in your patchset it was the same.
Unifying in the opposite direction is possible, i.e. implementing
blk_rq_map_kern() via blk_rq_map_kern_sg(). But this will save a very
minor code piece (few tens LOCs). Would you like me to do it?

> Or is there some fundamental reason that can't be done that I missed?

See above.

> Also, code organization-wise, I think good part of the posted code
> belongs to bio.c.

Could you be more specific, please?

> The tail-only copying is nice but I'm not entirely sure whether such
> full-blown approach is necessary. The tail padding was added
> primarily for dumb ATAPI devices which want to transfer more bytes
> than requested. Having extra space at the end makes the driver's job
> much easier as it can just overflow into the area. Some controller do
> have "drain after this" flag in the dma table but some simply can't
> handle such situations properly without explicit overflow area.
>
> So, being the horrid hack it is and highly unlikely to be used in
> performance sensitive path, I think it would be better to keep the
> implementation simple and dull. It just isn't something worth
> investing complexity over. Of course, if you have a use case which
> requires high performance tail padding, it's a different story.

SCST allocates buffers in pages, so there is always a tail space, except
one relatively rare case of scst_local target driver, which allows usage
of the SCST-exported devices locally on the same host. For it the tail
spacing information is lost in the entrails of the block/SCSI
subsystems. So, why not to have the tail-only copying? Code-wise it's
quite simple, just few tens lines of code.

Thanks,
Vlad

Vladislav Bolkhovitin

unread,

Jul 16, 2009, 2:20:09 PM7/16/09

This will uncover internal details of SG-chaining implementation, which
you wanted to hide. Or didn't?

>> + while (hbio != NULL) {
>> + bio = hbio;
>> + hbio = hbio->bi_next;
>> + bio->bi_next = NULL;
>> +
>> + blk_queue_bounce(q, &bio);
>
> I do not understand. If you are bouncing on the bio level.
> why do you need to do all the alignment checks and
> sg-bouncing + tail handling all this is done again on the bio
> level.

blk_queue_bounce() does another and completely independent level of
bouncing for pages allocated in the not accessible by the device area.

> It looks to me that perhaps you did not understand Tejun's code
> His code, as I remember, was on the bio level, but here you
> are duplicating what is done in bio level.
>
> But maybe I don't understand. Tejun?
>

>> + req->cmd_len = cmd_len;
>> + memset(req->cmd, 0, BLK_MAX_CDB); /* ATAPI hates garbage after CDB */
>> + memcpy(req->cmd, cmd, req->cmd_len);
>
> Does not support large commands.

Will be fixed.

(BTW, the SCSI layer assumes large CDBs as single big buffers, but all
SCSI transports I checked transfer large CDBs in 2 different parts: the
first 16 bytes and the rest. I.e. they deal with 2 different buffers.
So, the target side (SCST) deals with 2 buffers for large CDBs as well.
Having only one req->cmd will force SCST to merge those 2 buffers into a
single buffer. So, scs[i,t]_execute_async() will have to make an
allocation for this as well.)

Perhaps, yes.

> I would pass a nents count to below and discard this one.

Perhaps, yes, but only for possible future use. I don't see any use of
it at the moment.

Thanks,
Vlad

Vladislav Bolkhovitin

unread,

Jul 16, 2009, 2:20:09 PM7/16/09

Boaz Harrosh, on 07/15/2009 12:19 PM wrote:
>>>> block/blk-map.c | 536 +++++++++++++++++++++++++++++++++++++++++++++
>>>> drivers/scsi/scsi_lib.c | 108 ++++++++-
>>>> include/linux/blkdev.h | 6
>>>> include/scsi/scsi_device.h | 11
>>>> 4 files changed, 660 insertions(+), 1 deletion(-)
>>>>
>>> The scsi bits are not needed and just add complexity.
>>> allocate the request, call blk_rq_map_kern_sg() and
>>> blk_execute_xxx directly from your driver. Since you exacly
>>> know your code path lots of if(s) and flags are saved and that ugly
>>> scsi_io_context allocation.
>> Are you against any helper functions? scsi_lib.c has only scsi_execute_async(),
>> which is a small helper function, which only hides internal machinery details.
>> All the flags it has are needed and can't be avoided, especially
>> SCSI_ASYNC_EXEC_FLAG_HAS_TAIL_SPACE_FOR_PADDING
>
> Yes I'm sure. This will not be accepted. Just do it directly in the driver
> like all other ULD's and block layer users. Use direct blk API. All these little
> things and flags you are talking about are block specific, there is no single thing
> SCSI about them.

Hmm, I don't understand. What's wrong in this small helper function? Are
you against any helper functions, so, all users of, e.g., for_each_sg()
must instead open code the corresponding functionality?

Do you want me to move scsi_execute_async() from scsi_lib.c to
scst_lib.c and rename it to scst_execute_async()? This is the maximum I
can do, because I need this functionality. But it will make all other
possible users to duplicate it.

Does flag SCSI_ASYNC_EXEC_FLAG_AT_HEAD have nothing to do to SCSI, really?

>> The allocation of scsi_io_context is unavoidable, because sense buffer should
>> be allocated anyway.
>>
>
> NO!, driver needs to do it's job and provide a sense buffer and call blk API
> just like all the other drivers. this has nothing to do with SCSI.

What job should a driver do? As I wrote, for SCST allocation of the
sense buffer is necessary in any case (in SCST sense is allocated on
demand only), so it doesn't matter to allocate few bytes more, if it
allows to hide unneeded low level details. The same, I guess, should be
true for all other possible users of this functionality.

I can't see how *well documented* stacking of end_io_data can be/lead to
any problem. All the possible alternatives I can see are worse:

1. Add in struct request one more field, like "void *blk_end_io_data"
and use it.

2. Duplicate code of bio's allocation and chaining
(__blk_rq_map_kern_sg()) for the copy case with additional code for
allocation and copying of the copy buffers on per-bio basis and use
bio->bi_private to track the copy info. Tejun Heo used this approach,
but he had only one bio without any chaining. With the chaining this
approach becomes terribly overcomplicated and ugly with *NO* real gain.

Do you like any of them? If not, I'd like to see _practical_ suggestions.

Thanks.
Vlad

Vladislav Bolkhovitin

unread,

Jul 16, 2009, 2:20:11 PM7/16/09

Joe Eykholt, on 07/15/2009 09:48 PM wrote:
>>> +int sg_copy(struct scatterlist *dst_sg,
>>> + struct scatterlist *src_sg, size_t copy_len,
>>> + enum km_type d_km_type, enum km_type s_km_type)

>>> +{
>>> + int res = 0;
>>> + size_t dst_len, dst_offs;
>>> +
>>> + if (copy_len == 0)
>>> + copy_len = 0x7FFFFFFF; /* copy all */
>>> +
>>> + dst_len = dst_sg->length;
>>> + dst_offs = dst_sg->offset;
>>> +
>>> + do {
>>> + copy_len -= __sg_copy_elem(&dst_sg, &dst_len, &dst_offs,
>>> + src_sg, copy_len, d_km_type, s_km_type);
>>> + if ((copy_len == 0) || (dst_sg == NULL))
>>> + goto out;
>>> +
>>> + src_sg = sg_next(src_sg);
>>> + } while (src_sg != NULL);
>>> +
>>> +out:
>>> + return res;
>>> +}
>
> The return value res is always 0 here, contrary to the description.
> Maybe it should be void.

Will be fixed, thanks

>>> +EXPORT_SYMBOL(sg_copy);
>>>
>> Boaz
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in

Boaz Harrosh

unread,

Jul 19, 2009, 7:40:06 AM7/19/09

On 07/16/2009 09:15 PM, Vladislav Bolkhovitin wrote:
> Boaz Harrosh, on 07/15/2009 01:07 PM wrote:
>> On 07/14/2009 07:17 PM, Vladislav Bolkhovitin wrote:

<snip>

>>> +struct blk_kern_sg_hdr {
>>> + struct scatterlist *orig_sgp;
>>> + union {
>>> + struct sg_table new_sg_table;
>>> + struct scatterlist *saved_sg;
>>> + };
>> There is a struct scatterlist * inside struct sg_table
>> just use that. No need for a union. (It's not like your saving
>> space). In the tail case nents == 1.
>> (orig_nents==0 and loose the tail_only)
>
> This will uncover internal details of SG-chaining implementation, which
> you wanted to hide. Or didn't?
>

Are you commenting on the remove of tail_only, or the reuse of
->new_sg_table.sgl instead of ->saved_sg. The later makes tons of sense
to me.

if you want to keep the "tail_only" and not hack with nents && orig_nents
that's fine.

>>> + while (hbio != NULL) {
>>> + bio = hbio;
>>> + hbio = hbio->bi_next;
>>> + bio->bi_next = NULL;
>>> +
>>> + blk_queue_bounce(q, &bio);
>> I do not understand. If you are bouncing on the bio level.
>> why do you need to do all the alignment checks and
>> sg-bouncing + tail handling all this is done again on the bio
>> level.
>
> blk_queue_bounce() does another and completely independent level of
> bouncing for pages allocated in the not accessible by the device area.
>

No! you miss-understood. blk_queue_bounce() does all the bouncing,
high-mem, alignment, padding, draining, all of them.

The code you took from Tejun was meant to go at the bio level. Here
you are already taken care of.

All you need to do is:
loop on all pages
add_pc_page();
when bio is full
chain a new bio
and so on.

Then at the end call blk_make_request() which will do the bouncing and
the merging. No need for all the copy/tail_only etc.. this is done by
lower levels for you already.

If you do that then Tejun is right you should do an:
bio_map_sg()
and not:
blk_map_sg()

>> It looks to me that perhaps you did not understand Tejun's code
>> His code, as I remember, was on the bio level, but here you
>> are duplicating what is done in bio level.
>>
>> But maybe I don't understand. Tejun?
>>
>>> + req->cmd_len = cmd_len;
>>> + memset(req->cmd, 0, BLK_MAX_CDB); /* ATAPI hates garbage after CDB */
>>> + memcpy(req->cmd, cmd, req->cmd_len);
>> Does not support large commands.
>
> Will be fixed.
>
> (BTW, the SCSI layer assumes large CDBs as single big buffers, but all
> SCSI transports I checked transfer large CDBs in 2 different parts: the
> first 16 bytes and the rest. I.e. they deal with 2 different buffers.
> So, the target side (SCST) deals with 2 buffers for large CDBs as well.
> Having only one req->cmd will force SCST to merge those 2 buffers into a
> single buffer.
> So, scs[i,t]_execute_async() will have to make an
> allocation for this as well.)
>

Yep, allocation of command buffer if command_size > 16

Why? for example the use of not having another export.

Boaz Harrosh

unread,

Jul 19, 2009, 7:40:12 AM7/19/09

Yes that's what I want, and if you can not save some of it's complexity
by, for example, merging of scsi_io_context with the request structure you
already have at scst level, or save the double-chained-callbacks at the end,
then you've done something wrong.

> Does flag SCSI_ASYNC_EXEC_FLAG_AT_HEAD have nothing to do to SCSI, really?
>

Exactly, it's just a parameter at the call to blk_execute_nowait() with it's
own defines already. Nothing SCSI.

>>> The allocation of scsi_io_context is unavoidable, because sense buffer should
>>> be allocated anyway.
>>>
>> NO!, driver needs to do it's job and provide a sense buffer and call blk API
>> just like all the other drivers. this has nothing to do with SCSI.
>
> What job should a driver do? As I wrote, for SCST allocation of the
> sense buffer is necessary in any case (in SCST sense is allocated on
> demand only), so it doesn't matter to allocate few bytes more, if it
> allows to hide unneeded low level details. The same, I guess, should be
> true for all other possible users of this functionality.
>

Easiest, as all users do is: If you already have your scst request or
command, or what ever you call it. So add a sense_buffer array embedded
there, end of story no allocation needed.

How can you be a scsi-target and return the proper errors upstream,
without digging into sense-buffer and error returns to produce the
correct report.

Again all this is not needed and should be dropped it is already done
by bio bouncing. All you need to do is add the pages to bios, chain when
full, and call blk_make_request.

> Thanks.
> Vlad

Boaz

Vladislav Bolkhovitin

unread,

Jul 20, 2009, 2:00:17 PM7/20/09

OK, I'll do. But people who will miss it, will blame you by pretty bad
words!

> Yes that's what I want, and if you can not save some of it's complexity
> by, for example, merging of scsi_io_context with the request structure you
> already have at scst level, or save the double-chained-callbacks at the end,
> then you've done something wrong.

Are you sure??

As I already wrote, in SCST sense allocated on-demand only. It is done
for performance reasons, because the sense is needed _very_ rarely and
the buffer for it is quite big, so it's better to save those allocating,
zeroing and CPU cache trashing. The fact that Linux SCSI ML requires
sense buffer preallocation is for me something rather to be fixed.
Pass-through SCST backend is used relatively rarely as well, so it's
better to keep the sense preallocation for it in the separate code path.
Then, if there is the sense preallocation, it doesn't matter to alloc
with it few bytes more.

Is it getting clearer now?

>> Does flag SCSI_ASYNC_EXEC_FLAG_AT_HEAD have nothing to do to SCSI, really?
>>
>
> Exactly, it's just a parameter at the call to blk_execute_nowait() with it's
> own defines already. Nothing SCSI.

Well, going this way we can go too far.. Like, to assertion that SCSI
HEAD OF QUEUE task attribute is a property of the Linux block layer as well.

>>>> The allocation of scsi_io_context is unavoidable, because sense buffer should
>>>> be allocated anyway.
>>>>
>>> NO!, driver needs to do it's job and provide a sense buffer and call blk API
>>> just like all the other drivers. this has nothing to do with SCSI.
>> What job should a driver do? As I wrote, for SCST allocation of the
>> sense buffer is necessary in any case (in SCST sense is allocated on
>> demand only), so it doesn't matter to allocate few bytes more, if it
>> allows to hide unneeded low level details. The same, I guess, should be
>> true for all other possible users of this functionality.
>>
>
> Easiest, as all users do is: If you already have your scst request or
> command, or what ever you call it. So add a sense_buffer array embedded
> there, end of story no allocation needed.
>
> How can you be a scsi-target and return the proper errors upstream,
> without digging into sense-buffer and error returns to produce the
> correct report.

Most SCST backends don't deal with senses at all and generate them only
at the latest stage of the commands processing to deliver errors (e.g.,
file read errors) to initiators.

Can you show me a place where the bio bouncing, i.e.
blk_queue_bounce(), does bouncing of misaligned buffers?

I don't see such places. Instead, I see that all users of the block API,
who cares about the alignment (sg, st, sr, etc.), directly or indirectly
take care about it, by, e.g., switching to copying functions, before
calling blk_queue_bounce(). See blk_rq_map_user_iov(), for instance.
This is exactly what I implemented: handling of misaligned buffers on
the layer upper blk_queue_bounce().

Thanks,
Vlad

Vladislav Bolkhovitin

unread,

Jul 20, 2009, 2:00:20 PM7/20/09

Boaz Harrosh, on 07/19/2009 03:35 PM wrote:
> On 07/16/2009 09:15 PM, Vladislav Bolkhovitin wrote:
>> Boaz Harrosh, on 07/15/2009 01:07 PM wrote:
>>> On 07/14/2009 07:17 PM, Vladislav Bolkhovitin wrote:
> <snip>
>>>> +struct blk_kern_sg_hdr {
>>>> + struct scatterlist *orig_sgp;
>>>> + union {
>>>> + struct sg_table new_sg_table;
>>>> + struct scatterlist *saved_sg;
>>>> + };
>>> There is a struct scatterlist * inside struct sg_table
>>> just use that. No need for a union. (It's not like your saving
>>> space). In the tail case nents == 1.
>>> (orig_nents==0 and loose the tail_only)
>> This will uncover internal details of SG-chaining implementation, which
>> you wanted to hide. Or didn't?
>>
>
> Are you commenting on the remove of tail_only, or the reuse of
> ->new_sg_table.sgl instead of ->saved_sg. The later makes tons of sense
> to me.
>
> if you want to keep the "tail_only" and not hack with nents && orig_nents
> that's fine.

Yes, this is exactly what's done.

>>>> + while (hbio != NULL) {
>>>> + bio = hbio;
>>>> + hbio = hbio->bi_next;
>>>> + bio->bi_next = NULL;
>>>> +
>>>> + blk_queue_bounce(q, &bio);
>>> I do not understand. If you are bouncing on the bio level.
>>> why do you need to do all the alignment checks and
>>> sg-bouncing + tail handling all this is done again on the bio
>>> level.
>> blk_queue_bounce() does another and completely independent level of
>> bouncing for pages allocated in the not accessible by the device area.
>>
>
> No! you miss-understood. blk_queue_bounce() does all the bouncing,
> high-mem, alignment, padding, draining, all of them.

Sorry, Boaz, but looks like it is you who don't have the whole picture.
See the end of my previous e-mail.

> The code you took from Tejun was meant to go at the bio level. Here
> you are already taken care of.
>
> All you need to do is:
> loop on all pages
> add_pc_page();
> when bio is full
> chain a new bio
> and so on.

Isn't it what __blk_rq_map_kern_sg() does?

> Then at the end call blk_make_request() which will do the bouncing and
> the merging. No need for all the copy/tail_only etc.. this is done by
> lower levels for you already.

It is _not_ done. See the end of my previous e-mail.

> If you do that then Tejun is right you should do an:
> bio_map_sg()
> and not:
> blk_map_sg()
>
>>> It looks to me that perhaps you did not understand Tejun's code
>>> His code, as I remember, was on the bio level, but here you
>>> are duplicating what is done in bio level.
>>>
>>> But maybe I don't understand. Tejun?
>>>
>>>> + req->cmd_len = cmd_len;
>>>> + memset(req->cmd, 0, BLK_MAX_CDB); /* ATAPI hates garbage after CDB */
>>>> + memcpy(req->cmd, cmd, req->cmd_len);
>>> Does not support large commands.
>> Will be fixed.
>>
>> (BTW, the SCSI layer assumes large CDBs as single big buffers, but all
>> SCSI transports I checked transfer large CDBs in 2 different parts: the
>> first 16 bytes and the rest. I.e. they deal with 2 different buffers.
>> So, the target side (SCST) deals with 2 buffers for large CDBs as well.
>> Having only one req->cmd will force SCST to merge those 2 buffers into a
>> single buffer.
>> So, scs[i,t]_execute_async() will have to make an
>> allocation for this as well.)
>>
>
> Yep, allocation of command buffer if command_size > 16

...which in the majority of cases (if not in all, very possibly) isn't
needed and can be avoided, because the low level drivers then will split
it again on 16 bytes and the rest to deliver to the target. But it isn't
related to this our discussion, except as a one more reason to have the
scsi_io_context allocation.

It can be handled without nents by use of copy_len.

Boaz Harrosh

unread,

Jul 21, 2009, 4:10:10 AM7/21/09

Bring them on. I'm ready

>> Yes that's what I want, and if you can not save some of it's complexity
>> by, for example, merging of scsi_io_context with the request structure you
>> already have at scst level, or save the double-chained-callbacks at the end,
>> then you've done something wrong.
>
> Are you sure??
>
> As I already wrote, in SCST sense allocated on-demand only. It is done
> for performance reasons, because the sense is needed _very_ rarely and
> the buffer for it is quite big, so it's better to save those allocating,
> zeroing and CPU cache trashing. The fact that Linux SCSI ML requires
> sense buffer preallocation is for me something rather to be fixed.
> Pass-through SCST backend is used relatively rarely as well, so it's
> better to keep the sense preallocation for it in the separate code path.
> Then, if there is the sense preallocation, it doesn't matter to alloc
> with it few bytes more.
>
> Is it getting clearer now?
>

Yes it is now clear to me that scst is a mess. you cannot have more then
can_queue commands in-filght. 256 allocated mem_cache buffers of size 64-bytes
or size 96+64 bytes does not matter performance. It's the number of allocations
that count not the size of allocation, usually. If you want you can have a bigger
mem_cache in the scsi-pass-through case to hold a special-with-sense scst-command-
structure.

All drivers that care about sense find a way.

BTW you can call blk_execute_nowait without a sense buffer at all, it is optional.

>>> Does flag SCSI_ASYNC_EXEC_FLAG_AT_HEAD have nothing to do to SCSI, really?
>>>
>> Exactly, it's just a parameter at the call to blk_execute_nowait() with it's
>> own defines already. Nothing SCSI.
>
> Well, going this way we can go too far.. Like, to assertion that SCSI
> HEAD OF QUEUE task attribute is a property of the Linux block layer as well.
>

No "SCSI HEAD OF QUEUE task attribute" is a scsi attribute, none-scsi block
devices will not understand what it is. It is executed by the scsi-device.
Your flag will work for all block devices exactly because it is a parameter
to the block layer.

If you are implementing the scsi protocol to the letter then you must make sure
all subsystems comply for example pass the proper flag to the Linux block subsystem.

Hell look at your own code. And answer a simple question what does
SCSI_ASYNC_EXEC_FLAG_AT_HEAD do, and how does it do it:

+ blk_execute_rq_nowait(req->q, NULL, req,
+ flags & SCSI_ASYNC_EXEC_FLAG_AT_HEAD, scsi_end_async);

I'd say it's just a bit carrier to a boolean parameter at the call to
blk_execute_rq_nowait

>>>>> The allocation of scsi_io_context is unavoidable, because sense buffer should
>>>>> be allocated anyway.
>>>>>
>>>> NO!, driver needs to do it's job and provide a sense buffer and call blk API
>>>> just like all the other drivers. this has nothing to do with SCSI.
>>> What job should a driver do? As I wrote, for SCST allocation of the
>>> sense buffer is necessary in any case (in SCST sense is allocated on
>>> demand only), so it doesn't matter to allocate few bytes more, if it
>>> allows to hide unneeded low level details. The same, I guess, should be
>>> true for all other possible users of this functionality.
>>>
>> Easiest, as all users do is: If you already have your scst request or
>> command, or what ever you call it. So add a sense_buffer array embedded
>> there, end of story no allocation needed.
>>
>> How can you be a scsi-target and return the proper errors upstream,
>> without digging into sense-buffer and error returns to produce the
>> correct report.
>
> Most SCST backends don't deal with senses at all and generate them only
> at the latest stage of the commands processing to deliver errors (e.g.,
> file read errors) to initiators.
>

Sense at the block-layer is optional just don't use it if you don't need
it.

It's there look harder

> I don't see such places. Instead, I see that all users of the block API,
> who cares about the alignment (sg, st, sr, etc.), directly or indirectly
> take care about it, by, e.g., switching to copying functions, before
> calling blk_queue_bounce(). See blk_rq_map_user_iov(), for instance.
> This is exactly what I implemented: handling of misaligned buffers on
> the layer upper blk_queue_bounce().
>

Read the code again, I agree it is a grate mess but ...

For example what about filesystems they just put buffers on a bio and call
generic_make_request. sg st sr all do grate stupid things because of historical
reasons.

See the old scsi_execute_async of scsi_lib.c where was that alignment and
padding? there was not! did it work?

Look at blk_map_kern, where is alignment and padding handling? does it work?

> Thanks,
> Vlad
>
>

Boaz

Vladislav Bolkhovitin

unread,

Jul 21, 2009, 2:30:09 PM7/21/09

Boaz Harrosh, on 07/21/2009 12:03 PM wrote:
>>> Yes that's what I want, and if you can not save some of it's complexity
>>> by, for example, merging of scsi_io_context with the request structure you
>>> already have at scst level, or save the double-chained-callbacks at the end,
>>> then you've done something wrong.
>> Are you sure??
>>
>> As I already wrote, in SCST sense allocated on-demand only. It is done
>> for performance reasons, because the sense is needed _very_ rarely and
>> the buffer for it is quite big, so it's better to save those allocating,
>> zeroing and CPU cache trashing. The fact that Linux SCSI ML requires
>> sense buffer preallocation is for me something rather to be fixed.
>> Pass-through SCST backend is used relatively rarely as well, so it's
>> better to keep the sense preallocation for it in the separate code path.
>> Then, if there is the sense preallocation, it doesn't matter to alloc
>> with it few bytes more.
>>
>> Is it getting clearer now?
>
> Yes it is now clear to me that scst is a mess.

Sounds impressive. But how much this claim related to the reality? Have
you even looked at the SCST code?

> you cannot have more then
> can_queue commands in-filght. 256 allocated mem_cache buffers of size 64-bytes
> or size 96+64 bytes does not matter performance. It's the number of allocations
> that count not the size of allocation, usually. If you want you can have a bigger
> mem_cache in the scsi-pass-through case to hold a special-with-sense scst-command-
> structure.

Just FYI: (1) SCST from the very beginning has been designed and
implemented to have the best possible performance, (2) InfiniBand
already have 1 mks latency and this doesn't seem to be a limit, (3) the
maximum sense size defined by SPC is 252 bytes, not 96.

> All drivers that care about sense find a way.
>
> BTW you can call blk_execute_nowait without a sense buffer at all, it is optional.

And so? To lost this sense, if there is one? Would you do so yourself?

Boaz, what are you doing? Are we discussing a particular implementation
problem or are you trying by any means to prove how stupid and ignorant
I am? Let's return to the problem and stop doing nonsense claims!

>>>> Does flag SCSI_ASYNC_EXEC_FLAG_AT_HEAD have nothing to do to SCSI, really?
>>>>
>>> Exactly, it's just a parameter at the call to blk_execute_nowait() with it's
>>> own defines already. Nothing SCSI.
>> Well, going this way we can go too far.. Like, to assertion that SCSI
>> HEAD OF QUEUE task attribute is a property of the Linux block layer as well.
>>
>
> No "SCSI HEAD OF QUEUE task attribute" is a scsi attribute, none-scsi block
> devices will not understand what it is. It is executed by the scsi-device.
> Your flag will work for all block devices exactly because it is a parameter
> to the block layer.
>
> If you are implementing the scsi protocol to the letter then you must make sure
> all subsystems comply for example pass the proper flag to the Linux block subsystem.

Again rather a nonsense claim. Of course, in the absolute for full
support of HEAD OF QUEUE attribute blk should be modified to be aware of
it to keep the corresponding requests in the head of the queue as well
as to keep HEAD OF QUEUE attribute set for the corresponding SCSI
commands. But, from practical POV, if there is a big stuck queue,
inserting a HEAD OF QUEUE request in the head of it is a _lot_ better,
than in the tail.

> Hell look at your own code. And answer a simple question what does
> SCSI_ASYNC_EXEC_FLAG_AT_HEAD do, and how does it do it:
>
> + blk_execute_rq_nowait(req->q, NULL, req,
> + flags & SCSI_ASYNC_EXEC_FLAG_AT_HEAD, scsi_end_async);
>
> I'd say it's just a bit carrier to a boolean parameter at the call to
> blk_execute_rq_nowait

And so? Does it let us have a good approximation to what we need for
HEAD OF QUEUE commands? Yes, it does. This is a regular practice to pass
flags down to a call stack and it's used in a million of places.

Could you point to the exact lines of code, please?

>> I don't see such places. Instead, I see that all users of the block API,
>> who cares about the alignment (sg, st, sr, etc.), directly or indirectly
>> take care about it, by, e.g., switching to copying functions, before
>> calling blk_queue_bounce(). See blk_rq_map_user_iov(), for instance.
>> This is exactly what I implemented: handling of misaligned buffers on
>> the layer upper blk_queue_bounce().
>>
>
> Read the code again, I agree it is a grate mess but ...
>
> For example what about filesystems they just put buffers on a bio and call
> generic_make_request. sg st sr all do grate stupid things because of historical
> reasons.
>
> See the old scsi_execute_async of scsi_lib.c where was that alignment and
> padding? there was not! did it work?
>
> Look at blk_map_kern, where is alignment and padding handling? does it work?

Of course, it works. From blk_rq_map_kern() from 2.6.30.1:

do_copy = !blk_rq_aligned(q, kbuf, len) || object_is_on_stack(kbuf);
if (do_copy)
bio = bio_copy_kern(q, kbuf, len, gfp_mask, reading);
else
bio = bio_map_kern(q, kbuf, len, gfp_mask);
...
blk_queue_bounce(q, &rq->bio);
...
return 0;

Vlad

Vladislav Bolkhovitin

unread,

Aug 12, 2009, 1:50:12 PM8/12/09

This patch implements function blk_rq_map_kern_sg(), which allows to map
a kernel-originated SG vector to a block request. It is necessary to execute
SCSI commands with from kernel going SG buffer. At the moment SCST is the only

user of this functionality. It needs it, because its target drivers, which
are, basically, SCSI drivers, can deal only with SGs, not with BIOs. But,

according to the latest discussions, there can be other potential users for of

this functionality, so I'm sending this patch in a hope that it will be
also useful for them and eventually will be merged in the mainline kernel.

In the previous submissions this patch was called "New implementation of
scsi_execute_async()", but since in this version scsi_execute_async() was
removed from it by request of Boaz Harrosh the name was changed accordingly.

Highlight of this implementations:

- It uses BIOs chaining instead of kmalloc()'ing the whole bio.

- It uses SGs chaining instead of kmalloc()'ing one big SG in case if direct

mapping failed (e.g. because of DMA alignment or padding).

- When needed, copy_page() is used instead of memcpy() to copy the whole pages faster.

Also this patch adds and exports function sg_copy(), which copies one SG to
another.

Most of the changes since v2 were made based on the feedback from Boaz Harrosh
and Joe Eykholt (thanks!).

Change log since v2:

- scsi_execute_async() was moved from this patch inside SCST core (and gained
support for large CDBs and bidirectional commands (compile tested only))

- With help from some reference counting magic blk_rq_map_kern_sg() doesn't use
rq->end_io_data anymore and uses bio->bi_private instead.

- As a direct consequence of the bio->bi_private usage, the tail copy logic had to
be removed to not conflict with blk_queue_bounce(), which also uses it.

- In sg_copy() new argument nents_to_copy was added

- Return value of sg_copy() was fixed

- This patch doesn't require anymore previously sent patch with subject

"[PATCH]: Rename REQ_COPY_USER to more descriptive REQ_HAS_TAIL_SPACE_FOR_PADDING".

It's against 2.6.30.1, but if necessary, can be updated to any necessary
kernel version.

Signed-off-by: Vladislav Bolkhovitin <v...@vlnb.net>

block/blk-map.c | 331 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/blkdev.h | 3
include/linux/scatterlist.h | 5
lib/scatterlist.c | 129 +++++++++++++++++
4 files changed, 468 insertions(+)

diff -upkr linux-2.6.30.1/block/blk-map.c linux-2.6.30.1/block/blk-map.c

--- linux-2.6.30.1/block/blk-map.c 2009-07-10 21:13:25.000000000 +0400

+++ linux-2.6.30.1/block/blk-map.c 2009-08-12 20:01:43.000000000 +0400

@@ -5,6 +5,7 @@
#include <linux/module.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
+#include <linux/scatterlist.h>
#include <scsi/sg.h> /* for struct sg_iovec */

#include "blk.h"

@@ -272,6 +273,336 @@ int blk_rq_unmap_user(struct bio *bio)
}
EXPORT_SYMBOL(blk_rq_unmap_user);

+struct blk_kern_sg_work {
+ atomic_t bios_inflight;
+ struct sg_table sg_table;
+ struct scatterlist *src_sgl;
+};
+
+static void blk_free_kern_sg_work(struct blk_kern_sg_work *bw)
+{
+ sg_free_table(&bw->sg_table);
+ kfree(bw);
+ return;
+}
+
+static void blk_bio_map_kern_endio(struct bio *bio, int err)
+{
+ struct blk_kern_sg_work *bw = bio->bi_private;
+
+ if (bw != NULL) {
+ /* Decrement the bios in processing and, if zero, free */
+ BUG_ON(atomic_read(&bw->bios_inflight) <= 0);
+ if (atomic_dec_and_test(&bw->bios_inflight)) {
+ if ((bio_data_dir(bio) == READ) && (err == 0)) {
+ unsigned long flags;
+
+ local_irq_save(flags); /* to protect KMs */
+ sg_copy(bw->src_sgl, bw->sg_table.sgl, 0, 0,
+ KM_BIO_DST_IRQ, KM_BIO_SRC_IRQ);
+ local_irq_restore(flags);
+ }
+ blk_free_kern_sg_work(bw);
+ }
+ }
+
+ bio_put(bio);
+ return;
+}
+
+static int blk_rq_copy_kern_sg(struct request *rq, struct scatterlist *sgl,
+ int nents, struct blk_kern_sg_work **pbw,

+ gfp_t gfp, gfp_t page_gfp)
+{

+ int res = 0, i;

+ struct scatterlist *sg;
+ struct scatterlist *new_sgl;
+ int new_sgl_nents;

+ size_t len = 0, to_copy;

+ struct blk_kern_sg_work *bw;
+
+ bw = kzalloc(sizeof(*bw), gfp);
+ if (bw == NULL)
+ goto out;
+
+ bw->src_sgl = sgl;

+
+ for_each_sg(sgl, sg, nents, i)
+ len += sg->length;
+ to_copy = len;
+

+ new_sgl_nents = PFN_UP(len);
+
+ res = sg_alloc_table(&bw->sg_table, new_sgl_nents, gfp);

+ if (res != 0)

+ goto out_free_bw;
+
+ new_sgl = bw->sg_table.sgl;

+ * sgl might have the last element in sgl not marked as last in

+ * SG chaining.
+ */

+ sg_copy(new_sgl, sgl, 0, to_copy,
+ KM_USER0, KM_USER1);
+ }
+
+ *pbw = bw;
+ /*
+ * REQ_COPY_USER name is misleading. It should be something like
+ * REQ_HAS_TAIL_SPACE_FOR_PADDING.
+ */
+ rq->cmd_flags |= REQ_COPY_USER;

+
+out:
+ return res;
+
+err_free_new_sgl:
+ for_each_sg(new_sgl, sg, new_sgl_nents, i) {
+ struct page *pg = sg_page(sg);
+ if (pg == NULL)
+ break;
+ __free_page(pg);
+ }

+ sg_free_table(&bw->sg_table);
+
+out_free_bw:
+ kfree(bw);

+ res = -ENOMEM;
+ goto out;
+}
+

+static int __blk_rq_map_kern_sg(struct request *rq, struct scatterlist *sgl,

+ int nents, struct blk_kern_sg_work *bw, gfp_t gfp)
+{
+ int res;

+ struct request_queue *q = rq->q;
+ int rw = rq_data_dir(rq);
+ int max_nr_vecs, i;
+ size_t tot_len;
+ bool need_new_bio;
+ struct scatterlist *sg, *prev_sg = NULL;
+ struct bio *bio = NULL, *hbio = NULL, *tbio = NULL;

+ int bios;
+
+ if (unlikely((sgl == NULL) || (sgl->length == 0) || (nents <= 0))) {

+ WARN_ON(1);
+ res = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * Let's keep each bio allocation inside a single page to decrease
+ * probability of failure.
+ */
+ max_nr_vecs = min_t(size_t,
+ ((PAGE_SIZE - sizeof(struct bio)) / sizeof(struct bio_vec)),
+ BIO_MAX_PAGES);
+
+ need_new_bio = true;
+ tot_len = 0;

+ bios = 0;

+ for_each_sg(sgl, sg, nents, i) {
+ struct page *page = sg_page(sg);
+ void *page_addr = page_address(page);
+ size_t len = sg->length, l;
+ size_t offset = sg->offset;
+
+ tot_len += len;
+ prev_sg = sg;
+
+ /*
+ * Each segment must be aligned on DMA boundary and
+ * not on stack. The last one may have unaligned
+ * length as long as the total length is aligned to
+ * DMA padding alignment.
+ */
+ if (i == nents - 1)
+ l = 0;
+ else
+ l = len;
+ if (((sg->offset | l) & queue_dma_alignment(q)) ||
+ (page_addr && object_is_on_stack(page_addr + sg->offset))) {
+ res = -EINVAL;

+ goto out_free_bios;

+ }
+
+ while (len > 0) {
+ size_t bytes;
+ int rc;
+
+ if (need_new_bio) {
+ bio = bio_kmalloc(gfp, max_nr_vecs);
+ if (bio == NULL) {
+ res = -ENOMEM;
+ goto out_free_bios;
+ }
+
+ if (rw == WRITE)
+ bio->bi_rw |= 1 << BIO_RW;
+

+ bios++;
+ bio->bi_private = bw;
+ bio->bi_end_io = blk_bio_map_kern_endio;

+
+ if (hbio == NULL)
+ hbio = tbio = bio;
+ else
+ tbio = tbio->bi_next = bio;
+ }
+
+ bytes = min_t(size_t, len, PAGE_SIZE - offset);
+
+ rc = bio_add_pc_page(q, bio, page, bytes, offset);
+ if (rc < bytes) {
+ if (unlikely(need_new_bio || (rc < 0))) {
+ if (rc < 0)
+ res = rc;
+ else
+ res = -EIO;

+ goto out_free_bios;

+ } else {
+ need_new_bio = true;
+ len -= rc;
+ offset += rc;
+ continue;
+ }
+ }
+
+ need_new_bio = false;
+ offset = 0;
+ len -= bytes;
+ page = nth_page(page, 1);
+ }
+ }
+
+ if (hbio == NULL) {
+ res = -EINVAL;
+ goto out_free_bios;
+ }
+
+ /* Total length must be aligned on DMA padding alignment */
+ if ((tot_len & q->dma_pad_mask) &&

+ !(rq->cmd_flags & REQ_COPY_USER)) {

+ res = -EINVAL;
+ goto out_free_bios;
+ }
+

+ if (bw != NULL)
+ atomic_set(&bw->bios_inflight, bios);

+
+ while (hbio != NULL) {
+ bio = hbio;
+ hbio = hbio->bi_next;
+ bio->bi_next = NULL;
+
+ blk_queue_bounce(q, &bio);
+
+ res = blk_rq_append_bio(q, rq, bio);
+ if (unlikely(res != 0)) {
+ bio->bi_next = hbio;
+ hbio = bio;

+ /* We can have one or more bios bounced */
+ goto out_unmap_bios;

+ }
+ }
+
+ rq->buffer = rq->data = NULL;

+out:
+ return res;
+

+out_free_bios:
+ while (hbio != NULL) {
+ bio = hbio;
+ hbio = hbio->bi_next;
+ bio_put(bio);
+ }
+ goto out;
+

+out_unmap_bios:
+ blk_rq_unmap_kern_sg(rq, res);

+ goto out;
+}
+
+/**
+ * blk_rq_map_kern_sg - map kernel data to a request, for REQ_TYPE_BLOCK_PC
+ * @rq: request to fill
+ * @sgl: area to map
+ * @nents: number of elements in @sgl
+ * @gfp: memory allocation flags
+ *
+ * Description:
+ * Data will be mapped directly if possible. Otherwise a bounce
+ * buffer will be used.
+ */
+int blk_rq_map_kern_sg(struct request *rq, struct scatterlist *sgl,
+ int nents, gfp_t gfp)
+{
+ int res;
+

+ res = __blk_rq_map_kern_sg(rq, sgl, nents, NULL, gfp);

+ if (unlikely(res != 0)) {

+ struct blk_kern_sg_work *bw = NULL;
+
+ res = blk_rq_copy_kern_sg(rq, sgl, nents, &bw,
+ gfp, rq->q->bounce_gfp | gfp);

+ if (unlikely(res != 0))
+ goto out;
+

+ res = __blk_rq_map_kern_sg(rq, bw->sg_table.sgl,
+ bw->sg_table.nents, bw, gfp);

+ if (res != 0) {

+ blk_free_kern_sg_work(bw);

+ goto out;
+ }
+ }
+
+ rq->buffer = rq->data = NULL;
+
+out:
+ return res;
+}
+EXPORT_SYMBOL(blk_rq_map_kern_sg);
+
+/**

+ * blk_rq_unmap_kern_sg - unmap a request with kernel sg
+ * @rq: request to unmap
+ * @err: non-zero error code

+ *
+ * Description:

+ * Unmap a rq previously mapped by blk_rq_map_kern_sg(). Must be called
+ * only in case of an error!
+ */
+void blk_rq_unmap_kern_sg(struct request *rq, int err)
+{
+ struct bio *bio = rq->bio;
+
+ while (bio) {
+ struct bio *b = bio;
+ bio = bio->bi_next;
+ b->bi_end_io(b, err);
+ }
+ rq->bio = 0;
+

+ return;
+}
+EXPORT_SYMBOL(blk_rq_unmap_kern_sg);
+

/**
* blk_rq_map_kern - map kernel data to a request, for REQ_TYPE_BLOCK_PC usage
* @q: request queue where request should be inserted

diff -upkr linux-2.6.30.1/include/linux/blkdev.h linux-2.6.30.1/include/linux/blkdev.h
--- linux-2.6.30.1/include/linux/blkdev.h 2009-07-10 21:13:25.000000000 +0400
+++ linux-2.6.30.1/include/linux/blkdev.h 2009-08-12 19:48:06.000000000 +0400
@@ -807,6 +809,9 @@ extern int blk_rq_map_kern(struct reques

extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
struct rq_map_data *, struct sg_iovec *, int,
unsigned int, gfp_t);

+extern int blk_rq_map_kern_sg(struct request *rq, struct scatterlist *sgl,
+ int nents, gfp_t gfp);
+extern void blk_rq_unmap_kern_sg(struct request *rq, int err);

extern int blk_execute_rq(struct request_queue *, struct gendisk *,
struct request *, int);
extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,

diff -upkr linux-2.6.30.1/include/linux/scatterlist.h linux-2.6.30.1/include/linux/scatterlist.h
--- linux-2.6.30.1/include/linux/scatterlist.h 2009-06-10 07:05:27.000000000 +0400
+++ linux-2.6.30.1/include/linux/scatterlist.h 2009-08-12 19:50:02.000000000 +0400
@@ -3,6 +3,7 @@

#include <asm/types.h>
#include <asm/scatterlist.h>
+#include <asm/kmap_types.h>
#include <linux/mm.h>
#include <linux/string.h>
#include <asm/io.h>
@@ -218,6 +219,10 @@ size_t sg_copy_from_buffer(struct scatte

size_t sg_copy_to_buffer(struct scatterlist *sgl, unsigned int nents,
void *buf, size_t buflen);

+int sg_copy(struct scatterlist *dst_sg, struct scatterlist *src_sg,
+ int nents_to_copy, size_t copy_len,

+ enum km_type d_km_type, enum km_type s_km_type);
+

/*
* Maximum number of entries that will be allocated in one piece, if
* a list larger than this is required then chaining will be utilized.

return sg_copy_buffer(sgl, nents, buf, buflen, 1);
}
EXPORT_SYMBOL(sg_copy_to_buffer);
+
+/*

+ * Can switch to the next dst_sg element, so, to copy to strictly only
+ * one dst_sg element, it must be either last in the chain, or
+ * copy_len == dst_sg->length.
+ */
+static int sg_copy_elem(struct scatterlist **pdst_sg, size_t *pdst_len,

+ size_t *pdst_offs, struct scatterlist *src_sg,

+ size_t copy_len,

+ enum km_type d_km_type, enum km_type s_km_type)
+{
+ int res = 0;

+ struct scatterlist *dst_sg;

+ size_t src_len, dst_len, src_offs, dst_offs;
+ struct page *src_page, *dst_page;
+

+ dst_sg = *pdst_sg;
+ dst_len = *pdst_len;
+ dst_offs = *pdst_offs;

+ dst_page = sg_page(dst_sg);
+

+ src_page = sg_page(src_sg);
+ src_len = src_sg->length;
+ src_offs = src_sg->offset;
+
+ do {
+ void *saddr, *daddr;
+ size_t n;
+

+ saddr = kmap_atomic(src_page +
+ (src_offs >> PAGE_SHIFT), s_km_type) +
+ (src_offs & ~PAGE_MASK);
+ daddr = kmap_atomic(dst_page +
+ (dst_offs >> PAGE_SHIFT), d_km_type) +
+ (dst_offs & ~PAGE_MASK);
+
+ if (((src_offs & ~PAGE_MASK) == 0) &&

+ ((dst_offs & ~PAGE_MASK) == 0) &&

+ (src_len >= PAGE_SIZE) && (dst_len >= PAGE_SIZE) &&
+ (copy_len >= PAGE_SIZE)) {
+ copy_page(daddr, saddr);
+ n = PAGE_SIZE;
+ } else {

+ n = min_t(size_t, PAGE_SIZE - (dst_offs & ~PAGE_MASK),
+ PAGE_SIZE - (src_offs & ~PAGE_MASK));

+out:
+ *pdst_sg = dst_sg;
+ *pdst_len = dst_len;
+ *pdst_offs = dst_offs;
+ return res;

+}
+
+/**
+ * sg_copy - copy one SG vector to another

+ * @dst_sg: destination SG
+ * @src_sg: source SG

+ * @nents_to_copy: maximum number of entries to copy

+ * @copy_len: maximum amount of data to copy. If 0, then copy all.
+ * @d_km_type: kmap_atomic type for the destination SG
+ * @s_km_type: kmap_atomic type for the source SG
+ *
+ * Description:

+ * Data from the source SG vector will be copied to the destination SG

+ * vector. End of the vectors will be determined by sg_next() returning
+ * NULL. Returns number of bytes copied.
+ */

+int sg_copy(struct scatterlist *dst_sg, struct scatterlist *src_sg,
+ int nents_to_copy, size_t copy_len,

+ enum km_type d_km_type, enum km_type s_km_type)
+{
+ int res = 0;

+ size_t dst_len, dst_offs;

+
+ if (copy_len == 0)
+ copy_len = 0x7FFFFFFF; /* copy all */
+

+ if (nents_to_copy == 0)
+ nents_to_copy = 0x7FFFFFFF; /* copy all */
+

+ dst_len = dst_sg->length;
+ dst_offs = dst_sg->offset;
+

+ do {
+ int copied = sg_copy_elem(&dst_sg, &dst_len, &dst_offs,

+ src_sg, copy_len, d_km_type, s_km_type);

+ copy_len -= copied;
+ res += copied;

+ if ((copy_len == 0) || (dst_sg == NULL))

+ goto out;
+
+ nents_to_copy--;
+ if (nents_to_copy == 0)
+ goto out;
+

+ src_sg = sg_next(src_sg);
+ } while (src_sg != NULL);
+
+out:
+ return res;
+}

+EXPORT_SYMBOL(sg_copy);

Jens Axboe

unread,

Aug 15, 2009, 4:30:06 AM8/15/09

On Wed, Aug 12 2009, Vladislav Bolkhovitin wrote:
> This patch implements function blk_rq_map_kern_sg(), which allows to map
> a kernel-originated SG vector to a block request. It is necessary to execute
> SCSI commands with from kernel going SG buffer. At the moment SCST is the only
> user of this functionality. It needs it, because its target drivers, which
> are, basically, SCSI drivers, can deal only with SGs, not with BIOs. But,
> according to the latest discussions, there can be other potential users for of
> this functionality, so I'm sending this patch in a hope that it will be
> also useful for them and eventually will be merged in the mainline kernel.
>
> In the previous submissions this patch was called "New implementation of
> scsi_execute_async()", but since in this version scsi_execute_async() was
> removed from it by request of Boaz Harrosh the name was changed accordingly.

Generally this patch looks great, I just have one little thing I'd like
to point out:

> + while (hbio != NULL) {
> + bio = hbio;
> + hbio = hbio->bi_next;
> + bio->bi_next = NULL;
> +
> + blk_queue_bounce(q, &bio);
> +
> + res = blk_rq_append_bio(q, rq, bio);
> + if (unlikely(res != 0)) {
> + bio->bi_next = hbio;
> + hbio = bio;
> + /* We can have one or more bios bounced */
> + goto out_unmap_bios;
> + }
> + }

Constructs like this are always dangerous, because of how mempools work.
__blk_queue_bounce() will internally do:

bio = bio_alloc(GFP_NOIO, cnt);

so you could potentially enter a deadlock if a) you are the only one
allocating a bio currently, and b) the alloc fails and we wait for a bio
to be returned to the pool. This is highly unlikely and requires other
conditions to be dire, but it is a problem. This is not restricted to
the swap out path, the problem is purely lack of progress. So the golden
rule is always that you either allocate these units from a private pool
(which is hard for bouncing, since it does both page and bio allocations
from a mempool), or that you always ensure that a previously allocated
bio is in flight before attempting a new alloc.

--
Jens Axboe

Vladislav Bolkhovitin

unread,

Sep 3, 2009, 12:40:13 PM9/3/09

Sorry for the late reply, I was on vacation.

I see your concerns. Since all the bios in __blk_rq_map_kern_sg() at
first all allocated and only then submitted for I/O, bio_alloc() in
__blk_queue_bounce() potentially can deadlock, if it's called with
GFP_NOIO (i.e. with __GFP_WAIT) and its mempool gets empty. The fact
that __blk_rq_map_kern_sg() allocates originally bios using
bio_kmalloc() doesn't fundamentally change that, only low the failure
probability. (Just to make sure I understand everything correctly.)

Potentially this can be a problem, since SCST nearly always uses
GFP_KERNEL as the mask, i.e. has __GFP_WAIT set, although, I agree, the
deadlock is very unlikely.

To address it and other similar cases, which, I guess, should exist, I
see the following 2 ways:

1. Increase BIO_POOL_SIZE from current 2 to a bigger value to be large
enough to satisfy such full requests allocations for the maximum
requests. In ideal, for the worst case it should be something like for
2MB * NR_CPUS much data, which is 2MB / (BIO_MAX_PAGES * PAGE_SIZE) *
NR_CPUS = 2NR_CPUS with 4K pages. But on practice, possibly something
like 10-20 should be sufficient?

2. Modify blk_queue_bounce() that it can fail with bounce buffers
allocation and graciously process that in __blk_rq_map_kern_sg() and all
other similar places.

Which way would you prefer? Or do you think the probability for such
deadlock is so low, so it doesn't worth the effort to do anything with it?

Thanks a lot for review!
Vlad

0 new messages