[PATCH 3/5] target: Update percpu_ida_alloc to use task state bitmask

Nicholas A. Bellinger

unread,

Jan 19, 2014, 5:40:02 AM1/19/14

to

From: Nicholas Bellinger <n...@linux-iscsi.org>

This patch updates percpu_ida_alloc() for target/vhost to TASK_RUNNING
(cannot sleep) instead of GFP_ATOMIC, and updates iscsi-target to use
TASK_INTERRUPTIBLE (may sleep) or TASK_RUNNING (cannot sleep).

Both are following the changes to percpu_ida to allow for task state
bitmask to be passed from the caller.

Reported-by: Linus Torvalds <torv...@linux-foundation.org>
Cc: <sta...@vger.kernel.org> #3.12+
Signed-off-by: Nicholas Bellinger <n...@linux-iscsi.org>
---
drivers/target/iscsi/iscsi_target_util.c | 8 ++++++--
drivers/vhost/scsi.c | 2 +-
2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 0819e68..5477eca 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -156,9 +156,13 @@ struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *conn, gfp_t gfp_mask)
{
struct iscsi_cmd *cmd;
struct se_session *se_sess = conn->sess->se_sess;
- int size, tag;
+ int size, tag, state = (gfp_mask & __GFP_WAIT) ? TASK_INTERRUPTIBLE :
+ TASK_RUNNING;
+
+ tag = percpu_ida_alloc(&se_sess->sess_tag_pool, state);
+ if (tag < 0)
+ return NULL;

- tag = percpu_ida_alloc(&se_sess->sess_tag_pool, gfp_mask);
size = sizeof(struct iscsi_cmd) + conn->conn_transport->priv_size;
cmd = (struct iscsi_cmd *)(se_sess->sess_cmd_map + (tag * size));
memset(cmd, 0, size);
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 84488a8..2d084fb 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -728,7 +728,7 @@ vhost_scsi_get_tag(struct vhost_virtqueue *vq,
}
se_sess = tv_nexus->tvn_se_sess;

- tag = percpu_ida_alloc(&se_sess->sess_tag_pool, GFP_ATOMIC);
+ tag = percpu_ida_alloc(&se_sess->sess_tag_pool, TASK_RUNNING);
if (tag < 0) {
pr_err("Unable to obtain tag for tcm_vhost_cmd\n");
return ERR_PTR(-ENOMEM);
--
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Nicholas A. Bellinger

unread,

Jan 19, 2014, 5:40:02 AM1/19/14

to

From: Nicholas Bellinger <n...@linux-iscsi.org>

Hello Linus,

Here is the series for converting percpu_ida_alloc() + consumer
usage to accept the task state bitmask parameter, w/o the extra
legacy gfp_t wrapper.

Also, given that the first three patches needs to be backported
to different kernel versions, the series has been broken up into
seperate commits for blk-mq (v3.13.y) and target/* (v3.12.y) with
the minimal set of changes marked for stable.

The last two patches propigate task state bitmask usage further
up the blk-mq + target/iscsi stack, and as improvements are not
marked for stable code.

CC'ing Ingo + Peter for TASK_RUNNING + prepare_to_wait() bit.

Thank you,

--nab

Kent Overstreet (1):
percpu_ida: Make percpu_ida_alloc accept task state bitmask

Nicholas Bellinger (4):
blk-mq: Update percpu_ida_alloc to use task state bitmask
target: Update percpu_ida_alloc to use task state bitmask
blk-mq: Convert gfp_t parameters to task state bitmask
target/iscsi: Convert gfp_t parameter to task state bitmask

block/blk-core.c | 4 +++-
block/blk-flush.c | 2 +-
block/blk-mq-tag.c | 14 +++++++-------
block/blk-mq-tag.h | 2 +-
block/blk-mq.c | 28 ++++++++++++++--------------
drivers/infiniband/ulp/isert/ib_isert.c | 14 +++++++-------
drivers/target/iscsi/iscsi_target.c | 14 +++++++-------
drivers/target/iscsi/iscsi_target_util.c | 9 ++++++---
drivers/target/iscsi/iscsi_target_util.h | 2 +-
drivers/vhost/scsi.c | 2 +-
include/linux/blk-mq.h | 4 ++--
include/linux/percpu_ida.h | 3 ++-
include/target/iscsi/iscsi_transport.h | 2 +-
lib/percpu_ida.c | 17 +++++++++++------
14 files changed, 64 insertions(+), 53 deletions(-)

Nicholas A. Bellinger

unread,

Jan 19, 2014, 5:40:02 AM1/19/14

to

From: Kent Overstreet <k...@daterainc.com>

This patch changes percpu_ida_alloc() to accept task state bitmask
for prepare_to_wait() to support interruptible sleep for callers
that require it.

Previously, this code was assuming uninterruptible sleep for all cases
thus preventing signals from being delivered to a scheduled process
context in the tag stealing slow path.

It now accepts TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE when the
caller is able to sleep waiting for a new tag, or TASK_RUNNING when
the caller cannot sleep, and is forced to return a negative tag.

Reported-by: Linus Torvalds <torv...@linux-foundation.org>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Kent Overstreet <k...@daterainc.com>

Cc: <sta...@vger.kernel.org> #3.12+
Signed-off-by: Nicholas Bellinger <n...@linux-iscsi.org>
---

include/linux/percpu_ida.h | 3 ++-
lib/percpu_ida.c | 17 +++++++++++------
2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/percpu_ida.h b/include/linux/percpu_ida.h
index 1900bd0..f5cfdd6 100644
--- a/include/linux/percpu_ida.h
+++ b/include/linux/percpu_ida.h
@@ -4,6 +4,7 @@
#include <linux/types.h>
#include <linux/bitops.h>
#include <linux/init.h>
+#include <linux/sched.h>
#include <linux/spinlock_types.h>
#include <linux/wait.h>
#include <linux/cpumask.h>
@@ -61,7 +62,7 @@ struct percpu_ida {
/* Max size of percpu freelist, */
#define IDA_DEFAULT_PCPU_SIZE ((IDA_DEFAULT_PCPU_BATCH_MOVE * 3) / 2)

-int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp);
+int percpu_ida_alloc(struct percpu_ida *pool, int state);
void percpu_ida_free(struct percpu_ida *pool, unsigned tag);

void percpu_ida_destroy(struct percpu_ida *pool);
diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index 9d054bf..4579749 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -132,22 +132,22 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
/**
* percpu_ida_alloc - allocate a tag
* @pool: pool to allocate from
- * @gfp: gfp flags
+ * @state: task state for prepare_to_wait
*
* Returns a tag - an integer in the range [0..nr_tags) (passed to
* tag_pool_init()), or otherwise -ENOSPC on allocation failure.
*
* Safe to be called from interrupt context (assuming it isn't passed
- * __GFP_WAIT, of course).
+ * TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, of course).
*
* @gfp indicates whether or not to wait until a free id is available (it's not
* used for internal memory allocations); thus if passed __GFP_WAIT we may sleep
* however long it takes until another thread frees an id (same semantics as a
* mempool).
*
- * Will not fail if passed __GFP_WAIT.
+ * Will not fail if passed TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE.
*/
-int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
+int percpu_ida_alloc(struct percpu_ida *pool, int state)
{
DEFINE_WAIT(wait);
struct percpu_ida_cpu *tags;
@@ -174,7 +174,7 @@ int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
*
* global lock held and irqs disabled, don't need percpu lock
*/
- prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
+ prepare_to_wait(&pool->wait, &wait, state);

if (!tags->nr_free)
alloc_global_tags(pool, tags);
@@ -191,9 +191,14 @@ int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
spin_unlock(&pool->lock);
local_irq_restore(flags);

- if (tag >= 0 || !(gfp & __GFP_WAIT))
+ if (tag >= 0 || state == TASK_RUNNING)
break;

+ if (signal_pending_state(state, current)) {
+ tag = -ERESTARTSYS;
+ break;
+ }
+
schedule();

local_irq_save(flags);

Nicholas A. Bellinger

unread,

Jan 19, 2014, 5:40:02 AM1/19/14

to

From: Nicholas Bellinger <n...@linux-iscsi.org>

This patch propigates the use of task state bitmask for
percpu_ida_alloc() up the blk-mq callchain, to the point in
blk_get_request() where the blk-mq vs. blk-old split occurs.

Along with the obvious parameters changes, there are two cases
in mq_flush_work() + blk_mq_make_request() where the original
code was using __GFP_WAIT|GFP_ATOMIC that always expect a tag
which have been converted to TASK_UNINTERRUPTIBLE.

Reported-by: Linus Torvalds <torv...@linux-foundation.org>
Cc: Jens Axboe <ax...@kernel.dk>

Signed-off-by: Nicholas Bellinger <n...@linux-iscsi.org>
---

block/blk-core.c | 4 +++-
block/blk-flush.c | 2 +-

block/blk-mq-tag.c | 16 +++++++---------

block/blk-mq-tag.h | 2 +-
block/blk-mq.c | 28 ++++++++++++++--------------

include/linux/blk-mq.h | 4 ++--
6 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8bdd012..ab0dc9a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1112,7 +1112,9 @@ static struct request *blk_old_get_request(struct request_queue *q, int rw,
struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
{
if (q->mq_ops)
- return blk_mq_alloc_request(q, rw, gfp_mask, false);
+ return blk_mq_alloc_request(q, rw, (gfp_mask & __GFP_WAIT) ?
+ TASK_UNINTERRUPTIBLE : TASK_RUNNING,
+ false);
else
return blk_old_get_request(q, rw, gfp_mask);
}
diff --git a/block/blk-flush.c b/block/blk-flush.c
index fb6f3c0..8dd6ff8 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -286,7 +286,7 @@ static void mq_flush_work(struct work_struct *work)

/* We don't need set REQ_FLUSH_SEQ, it's for consistency */
rq = blk_mq_alloc_request(q, WRITE_FLUSH|REQ_FLUSH_SEQ,
- __GFP_WAIT|GFP_ATOMIC, true);
+ TASK_UNINTERRUPTIBLE, true);
rq->cmd_type = REQ_TYPE_FS;
rq->end_io = flush_end_io;

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 5d70edc..20777bd 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -32,19 +32,18 @@ bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
percpu_ida_free_tags(&tags->free_tags, nr_cpu_ids) != 0;
}

-static unsigned int __blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp)
+static unsigned int __blk_mq_get_tag(struct blk_mq_tags *tags, int state)
{
int tag;

- tag = percpu_ida_alloc(&tags->free_tags, (gfp & __GFP_WAIT) ?
- TASK_UNINTERRUPTIBLE : TASK_RUNNING);
+ tag = percpu_ida_alloc(&tags->free_tags, state);
if (tag < 0)
return BLK_MQ_TAG_FAIL;
return tag + tags->nr_reserved_tags;
}

static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_tags *tags,
- gfp_t gfp)
+ int state)
{
int tag;

@@ -53,19 +52,18 @@ static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_tags *tags,
return BLK_MQ_TAG_FAIL;
}

- tag = percpu_ida_alloc(&tags->reserved_tags, (gfp & __GFP_WAIT) ?
- TASK_UNINTERRUPTIBLE : TASK_RUNNING);
+ tag = percpu_ida_alloc(&tags->reserved_tags, state);
if (tag < 0)
return BLK_MQ_TAG_FAIL;
return tag;
}

-unsigned int blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp, bool reserved)
+unsigned int blk_mq_get_tag(struct blk_mq_tags *tags, int state, bool reserved)
{
if (!reserved)
- return __blk_mq_get_tag(tags, gfp);
+ return __blk_mq_get_tag(tags, state);

- return __blk_mq_get_reserved_tag(tags, gfp);
+ return __blk_mq_get_reserved_tag(tags, state);
}

static void __blk_mq_put_tag(struct blk_mq_tags *tags, unsigned int tag)
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 947ba2c..b3c1487 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -6,7 +6,7 @@ struct blk_mq_tags;
extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node);
extern void blk_mq_free_tags(struct blk_mq_tags *tags);

-extern unsigned int blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp, bool reserved);
+extern unsigned int blk_mq_get_tag(struct blk_mq_tags *tags, int state, bool reserved);
extern void blk_mq_wait_for_tags(struct blk_mq_tags *tags);
extern void blk_mq_put_tag(struct blk_mq_tags *tags, unsigned int tag);
extern void blk_mq_tag_busy_iter(struct blk_mq_tags *tags, void (*fn)(void *data, unsigned long *), void *data);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c79126e..80bbfbd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -75,13 +75,13 @@ static void blk_mq_hctx_mark_pending(struct blk_mq_hw_ctx *hctx,
set_bit(ctx->index_hw, hctx->ctx_map);
}

-static struct request *blk_mq_alloc_rq(struct blk_mq_hw_ctx *hctx, gfp_t gfp,
+static struct request *blk_mq_alloc_rq(struct blk_mq_hw_ctx *hctx, int state,
bool reserved)
{
struct request *rq;
unsigned int tag;

- tag = blk_mq_get_tag(hctx->tags, gfp, reserved);
+ tag = blk_mq_get_tag(hctx->tags, state, reserved);
if (tag != BLK_MQ_TAG_FAIL) {
rq = hctx->rqs[tag];
rq->tag = tag;
@@ -183,13 +183,13 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
}

static struct request *__blk_mq_alloc_request(struct blk_mq_hw_ctx *hctx,
- gfp_t gfp, bool reserved)
+ int state, bool reserved)
{
- return blk_mq_alloc_rq(hctx, gfp, reserved);
+ return blk_mq_alloc_rq(hctx, state, reserved);
}

static struct request *blk_mq_alloc_request_pinned(struct request_queue *q,
- int rw, gfp_t gfp,
+ int rw, int state,
bool reserved)
{
struct request *rq;
@@ -198,14 +198,14 @@ static struct request *blk_mq_alloc_request_pinned(struct request_queue *q,
struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q, ctx->cpu);

- rq = __blk_mq_alloc_request(hctx, gfp & ~__GFP_WAIT, reserved);
+ rq = __blk_mq_alloc_request(hctx, state, reserved);
if (rq) {
blk_mq_rq_ctx_init(q, ctx, rq, rw);
break;
}

blk_mq_put_ctx(ctx);
- if (!(gfp & __GFP_WAIT))
+ if (state == TASK_RUNNING)
break;

__blk_mq_run_hw_queue(hctx);
@@ -216,28 +216,28 @@ static struct request *blk_mq_alloc_request_pinned(struct request_queue *q,
}

struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
- gfp_t gfp, bool reserved)
+ int state, bool reserved)
{
struct request *rq;

if (blk_mq_queue_enter(q))
return NULL;

- rq = blk_mq_alloc_request_pinned(q, rw, gfp, reserved);
+ rq = blk_mq_alloc_request_pinned(q, rw, state, reserved);
if (rq)
blk_mq_put_ctx(rq->mq_ctx);
return rq;
}

struct request *blk_mq_alloc_reserved_request(struct request_queue *q, int rw,
- gfp_t gfp)
+ int state)
{
struct request *rq;

if (blk_mq_queue_enter(q))
return NULL;

- rq = blk_mq_alloc_request_pinned(q, rw, gfp, true);
+ rq = blk_mq_alloc_request_pinned(q, rw, state, true);
if (rq)
blk_mq_put_ctx(rq->mq_ctx);
return rq;
@@ -928,14 +928,14 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
hctx = q->mq_ops->map_queue(q, ctx->cpu);

trace_block_getrq(q, bio, rw);
- rq = __blk_mq_alloc_request(hctx, GFP_ATOMIC, false);
+ rq = __blk_mq_alloc_request(hctx, TASK_RUNNING, false);
if (likely(rq))
blk_mq_rq_ctx_init(q, ctx, rq, rw);
else {
blk_mq_put_ctx(ctx);
trace_block_sleeprq(q, bio, rw);
- rq = blk_mq_alloc_request_pinned(q, rw, __GFP_WAIT|GFP_ATOMIC,
- false);
+ rq = blk_mq_alloc_request_pinned(q, rw, TASK_UNINTERRUPTIBLE,
+ false);
ctx = rq->mq_ctx;
hctx = q->mq_ops->map_queue(q, ctx->cpu);
}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index ab0e9b2..91ee75a 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -124,8 +124,8 @@ void blk_mq_insert_request(struct request_queue *, struct request *, bool);
void blk_mq_run_queues(struct request_queue *q, bool async);
void blk_mq_free_request(struct request *rq);
bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
-struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp, bool reserved);
-struct request *blk_mq_alloc_reserved_request(struct request_queue *q, int rw, gfp_t gfp);
+struct request *blk_mq_alloc_request(struct request_queue *q, int rw, int state, bool reserved);
+struct request *blk_mq_alloc_reserved_request(struct request_queue *q, int rw, int state);
struct request *blk_mq_rq_from_tag(struct request_queue *q, unsigned int tag);

struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *, const int ctx_index);

Nicholas A. Bellinger

unread,

Jan 19, 2014, 5:40:03 AM1/19/14

to

From: Nicholas Bellinger <n...@linux-iscsi.org>

This patch updates percpu_ida_alloc() usage within blk-mq tag
allocation code to use TASK_UNINTERRUPTIBLE (may sleep) or
ASK_RUNNING (cannot sleep), following the changes to percpu_ida

to allow for task state bitmask to be passed from the caller.

Reported-by: Linus Torvalds <torv...@linux-foundation.org>
Cc: Jens Axboe <ax...@kernel.dk>

Cc: <sta...@vger.kernel.org> #3.13+

Signed-off-by: Nicholas Bellinger <n...@linux-iscsi.org>
---

block/blk-mq-tag.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d64a02f..5d70edc 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -36,7 +36,8 @@ static unsigned int __blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp)
{
int tag;

- tag = percpu_ida_alloc(&tags->free_tags, gfp);
+ tag = percpu_ida_alloc(&tags->free_tags, (gfp & __GFP_WAIT) ?
+ TASK_UNINTERRUPTIBLE : TASK_RUNNING);

if (tag < 0)
return BLK_MQ_TAG_FAIL;
return tag + tags->nr_reserved_tags;

@@ -52,7 +53,8 @@ static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_tags *tags,
return BLK_MQ_TAG_FAIL;
}

- tag = percpu_ida_alloc(&tags->reserved_tags, gfp);
+ tag = percpu_ida_alloc(&tags->reserved_tags, (gfp & __GFP_WAIT) ?
+ TASK_UNINTERRUPTIBLE : TASK_RUNNING);

if (tag < 0)
return BLK_MQ_TAG_FAIL;

return tag;

Nicholas A. Bellinger

unread,

Jan 19, 2014, 5:40:03 AM1/19/14

to

From: Nicholas Bellinger <n...@linux-iscsi.org>

This patch propigates the use of task state bitmask now used by
percpu_ida_alloc() up the iscsi-target callchain, replacing the
use of GFP_ATOMIC + GFP_KERNEL.

Also, drop the unnecessary gfp_t parameter to isert_allocate_cmd(),
and just pass TASK_INTERRUPTIBLE into iscsit_allocate_cmd().

Reported-by: Linus Torvalds <torv...@linux-foundation.org>

Signed-off-by: Nicholas Bellinger <n...@linux-iscsi.org>
---

drivers/infiniband/ulp/isert/ib_isert.c | 14 +++++++-------
drivers/target/iscsi/iscsi_target.c | 14 +++++++-------

drivers/target/iscsi/iscsi_target_util.c | 7 +++----
drivers/target/iscsi/iscsi_target_util.h | 2 +-
include/target/iscsi/iscsi_transport.h | 2 +-
5 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index 421182b..63bcf69 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -1028,13 +1028,13 @@ isert_rx_login_req(struct iser_rx_desc *rx_desc, int rx_buflen,
}

static struct iscsi_cmd
-*isert_allocate_cmd(struct iscsi_conn *conn, gfp_t gfp)
+*isert_allocate_cmd(struct iscsi_conn *conn)
{
struct isert_conn *isert_conn = (struct isert_conn *)conn->context;
struct isert_cmd *isert_cmd;
struct iscsi_cmd *cmd;

- cmd = iscsit_allocate_cmd(conn, gfp);
+ cmd = iscsit_allocate_cmd(conn, TASK_INTERRUPTIBLE);
if (!cmd) {
pr_err("Unable to allocate iscsi_cmd + isert_cmd\n");
return NULL;
@@ -1223,7 +1223,7 @@ isert_rx_opcode(struct isert_conn *isert_conn, struct iser_rx_desc *rx_desc,

switch (opcode) {
case ISCSI_OP_SCSI_CMD:
- cmd = isert_allocate_cmd(conn, GFP_KERNEL);
+ cmd = isert_allocate_cmd(conn);
if (!cmd)
break;

@@ -1237,7 +1237,7 @@ isert_rx_opcode(struct isert_conn *isert_conn, struct iser_rx_desc *rx_desc,
rx_desc, (unsigned char *)hdr);
break;
case ISCSI_OP_NOOP_OUT:
- cmd = isert_allocate_cmd(conn, GFP_KERNEL);
+ cmd = isert_allocate_cmd(conn);
if (!cmd)
break;

@@ -1250,7 +1250,7 @@ isert_rx_opcode(struct isert_conn *isert_conn, struct iser_rx_desc *rx_desc,
(unsigned char *)hdr);
break;
case ISCSI_OP_SCSI_TMFUNC:
- cmd = isert_allocate_cmd(conn, GFP_KERNEL);
+ cmd = isert_allocate_cmd(conn);
if (!cmd)
break;

@@ -1258,7 +1258,7 @@ isert_rx_opcode(struct isert_conn *isert_conn, struct iser_rx_desc *rx_desc,
(unsigned char *)hdr);
break;
case ISCSI_OP_LOGOUT:
- cmd = isert_allocate_cmd(conn, GFP_KERNEL);
+ cmd = isert_allocate_cmd(conn);
if (!cmd)
break;

@@ -1269,7 +1269,7 @@ isert_rx_opcode(struct isert_conn *isert_conn, struct iser_rx_desc *rx_desc,
HZ);
break;
case ISCSI_OP_TEXT:
- cmd = isert_allocate_cmd(conn, GFP_KERNEL);
+ cmd = isert_allocate_cmd(conn);
if (!cmd)
break;

diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
index d70e911..2a52752 100644
--- a/drivers/target/iscsi/iscsi_target.c
+++ b/drivers/target/iscsi/iscsi_target.c
@@ -621,7 +621,7 @@ static int iscsit_add_reject(
{
struct iscsi_cmd *cmd;

- cmd = iscsit_allocate_cmd(conn, GFP_KERNEL);
+ cmd = iscsit_allocate_cmd(conn, TASK_INTERRUPTIBLE);
if (!cmd)
return -1;

@@ -2476,7 +2476,7 @@ static void iscsit_build_conn_drop_async_message(struct iscsi_conn *conn)
if (!conn_p)
return;

- cmd = iscsit_allocate_cmd(conn_p, GFP_ATOMIC);
+ cmd = iscsit_allocate_cmd(conn_p, TASK_RUNNING);
if (!cmd) {
iscsit_dec_conn_usage_count(conn_p);
return;
@@ -3952,7 +3952,7 @@ static int iscsi_target_rx_opcode(struct iscsi_conn *conn, unsigned char *buf)

switch (hdr->opcode & ISCSI_OPCODE_MASK) {
case ISCSI_OP_SCSI_CMD:
- cmd = iscsit_allocate_cmd(conn, GFP_KERNEL);
+ cmd = iscsit_allocate_cmd(conn, TASK_INTERRUPTIBLE);
if (!cmd)
goto reject;

@@ -3964,28 +3964,28 @@ static int iscsi_target_rx_opcode(struct iscsi_conn *conn, unsigned char *buf)
case ISCSI_OP_NOOP_OUT:
cmd = NULL;
if (hdr->ttt == cpu_to_be32(0xFFFFFFFF)) {
- cmd = iscsit_allocate_cmd(conn, GFP_KERNEL);
+ cmd = iscsit_allocate_cmd(conn, TASK_INTERRUPTIBLE);
if (!cmd)
goto reject;
}
ret = iscsit_handle_nop_out(conn, cmd, buf);
break;
case ISCSI_OP_SCSI_TMFUNC:
- cmd = iscsit_allocate_cmd(conn, GFP_KERNEL);
+ cmd = iscsit_allocate_cmd(conn, TASK_INTERRUPTIBLE);
if (!cmd)
goto reject;

ret = iscsit_handle_task_mgt_cmd(conn, cmd, buf);
break;
case ISCSI_OP_TEXT:
- cmd = iscsit_allocate_cmd(conn, GFP_KERNEL);
+ cmd = iscsit_allocate_cmd(conn, TASK_INTERRUPTIBLE);
if (!cmd)
goto reject;

ret = iscsit_handle_text_cmd(conn, cmd, buf);
break;
case ISCSI_OP_LOGOUT:
- cmd = iscsit_allocate_cmd(conn, GFP_KERNEL);
+ cmd = iscsit_allocate_cmd(conn, TASK_INTERRUPTIBLE);
if (!cmd)
goto reject;

diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 5477eca..e655b04 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -152,12 +152,11 @@ void iscsit_free_r2ts_from_list(struct iscsi_cmd *cmd)
* May be called from software interrupt (timer) context for allocating
* iSCSI NopINs.
*/
-struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *conn, gfp_t gfp_mask)
+struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *conn, int state)

{
struct iscsi_cmd *cmd;
struct se_session *se_sess = conn->sess->se_sess;

- int size, tag, state = (gfp_mask & __GFP_WAIT) ? TASK_INTERRUPTIBLE :
- TASK_RUNNING;
+ int size, tag;

tag = percpu_ida_alloc(&se_sess->sess_tag_pool, state);

if (tag < 0)
@@ -930,7 +929,7 @@ static int iscsit_add_nopin(struct iscsi_conn *conn, int want_response)
u8 state;
struct iscsi_cmd *cmd;

- cmd = iscsit_allocate_cmd(conn, GFP_ATOMIC);
+ cmd = iscsit_allocate_cmd(conn, TASK_RUNNING);
if (!cmd)
return -1;

diff --git a/drivers/target/iscsi/iscsi_target_util.h b/drivers/target/iscsi/iscsi_target_util.h
index e4fc34a..561a424 100644
--- a/drivers/target/iscsi/iscsi_target_util.h
+++ b/drivers/target/iscsi/iscsi_target_util.h
@@ -9,7 +9,7 @@ extern struct iscsi_r2t *iscsit_get_r2t_from_list(struct iscsi_cmd *);
extern void iscsit_free_r2t(struct iscsi_r2t *, struct iscsi_cmd *);
extern void iscsit_free_r2ts_from_list(struct iscsi_cmd *);
extern struct iscsi_cmd *iscsit_alloc_cmd(struct iscsi_conn *, gfp_t);
-extern struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *, gfp_t);
+extern struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *, int);
extern struct iscsi_seq *iscsit_get_seq_holder_for_datain(struct iscsi_cmd *, u32);
extern struct iscsi_seq *iscsit_get_seq_holder_for_r2t(struct iscsi_cmd *);
extern struct iscsi_r2t *iscsit_get_holder_for_r2tsn(struct iscsi_cmd *, u32);
diff --git a/include/target/iscsi/iscsi_transport.h b/include/target/iscsi/iscsi_transport.h
index a12589c..ae5a171 100644
--- a/include/target/iscsi/iscsi_transport.h
+++ b/include/target/iscsi/iscsi_transport.h
@@ -94,7 +94,7 @@ extern int iscsit_tmr_post_handler(struct iscsi_cmd *, struct iscsi_conn *);
/*
* From iscsi_target_util.c
*/
-extern struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *, gfp_t);
+extern struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *, int);
extern int iscsit_sequence_cmd(struct iscsi_conn *, struct iscsi_cmd *,
unsigned char *, __be32);
extern void iscsit_release_cmd(struct iscsi_cmd *);

Linus Torvalds

unread,

Jan 19, 2014, 9:40:02 PM1/19/14

to

On Sun, Jan 19, 2014 at 2:16 AM, Nicholas A. Bellinger
<n...@linux-iscsi.org> wrote:
>
> This patch changes percpu_ida_alloc() to accept task state bitmask
> for prepare_to_wait() to support interruptible sleep for callers
> that require it.

This patch-series is not bisectable. Afaik, the first patch will break
the build (or at least cause the end result to not actually work).

This kind of "split up one large patch into many small patches THAT
DON'T ACTUALLY WORK INDIVIDUALLY" model is pure and utter garbage.

So a big NAK on this series as being completely broken.

To fix it, I would suggest:

- make the first patch change all *existing* users (that only have
the atomic vs uninterruptible semantics) pass in either
TASK_UNINTERRUPTIBLE or TASK_RUNNING depending on whether they had
__GFP_WAIT or not.

So the first patch would not change *any* semantics or behavior, it
would only change the calling convention.

- do the cleanup patches to block/blk-mq-tag.c to not have those
"gfp" calling convention, and instead passing in the state natively

- add the TASK_INTERRUPTIBLE case last (which includes the new
"signal_pending_state()" logic in percpu_ida_alloc())

that way, all patches compile cleanly and should each work
individually, and they all do clearly just one thing. And the biggest
patch in the series (the first one) doesn't actually make any semantic
changes.

Linus

Nicholas A. Bellinger

unread,

Jan 19, 2014, 11:10:02 PM1/19/14

to

From: Nicholas Bellinger <n...@linux-iscsi.org>

Hi Linus,

Here is the -v2 series for converting percpu_ida_alloc() + consumer

usage to accept the task state bitmask parameter, w/o the extra
legacy gfp_t wrapper.

As requested, the first patch contains only the parameter change
to percpu_ida_alloc() + existing consumers, and makes no semantic
or behavior change. This patch is CC'ed for stable, and will need
some light massaging to apply back to v3.12.y.

The second patch is a blk-mq cleanup to propigate the task state
bitmask usage up to the blk-mq vs. legacy split in blk_get_request().

The last patch fixes the original iscsi-target session reset bug
by passing TASK_INTERRUPTIBLE, and adds the signal_pending_state()
bit required in percpu_ida_alloc() code. This is also CC'ed for
v3.12.y.

CC'ing Ingo + Peter for TASK_RUNNING + prepare_to_wait() bit.

Thank you,

--nab

Kent Overstreet (1):

percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask

Nicholas Bellinger (2):

blk-mq: Convert gfp_t parameters to task state bitmask

iscsi-target: Fix connection reset hang with percpu_ida_alloc

block/blk-core.c | 4 +++-
block/blk-flush.c | 2 +-

block/blk-mq-tag.c | 14 +++++++-------

block/blk-mq-tag.h | 2 +-
block/blk-mq.c | 28 ++++++++++++++--------------

drivers/target/iscsi/iscsi_target_util.c | 8 ++++++--
drivers/target/tcm_fc/tfc_cmd.c | 2 +-

drivers/vhost/scsi.c | 2 +-
include/linux/blk-mq.h | 4 ++--

include/linux/percpu_ida.h | 3 ++-
lib/percpu_ida.c | 17 +++++++++++------

11 files changed, 49 insertions(+), 37 deletions(-)

--
1.7.10.4

Nicholas A. Bellinger

unread,

Jan 19, 2014, 11:10:02 PM1/19/14

to

From: Nicholas Bellinger <n...@linux-iscsi.org>

This patch addresses a bug where connection reset would hang
indefinately once percpu_ida_alloc() was starved for tags, due
to the fact that it always assumed uninterruptible sleep mode.

So now make percpu_ida_alloc() check for signal_pending_state() for
making interruptible sleep optional, and convert iscsit_allocate_cmd()
to set TASK_INTERRUPTIBLE for GFP_KERNEL, or TASK_RUNNING for
GFP_ATOMIC.

Reported-by: Linus Torvalds <torv...@linux-foundation.org>
Cc: Kent Overstreet <k...@daterainc.com>
Cc: <sta...@vger.kernel.org> #3.12+

Signed-off-by: Nicholas Bellinger <n...@linux-iscsi.org>
---

drivers/target/iscsi/iscsi_target_util.c | 2 +-
lib/percpu_ida.c | 9 +++++++--
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 9b8e1db..5477eca 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -156,7 +156,7 @@ struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *conn, gfp_t gfp_mask)

{
struct iscsi_cmd *cmd;
struct se_session *se_sess = conn->sess->se_sess;

- int size, tag, state = (gfp_mask & __GFP_WAIT) ? TASK_UNINTERRUPTIBLE :
+ int size, tag, state = (gfp_mask & __GFP_WAIT) ? TASK_INTERRUPTIBLE :
TASK_RUNNING;

tag = percpu_ida_alloc(&se_sess->sess_tag_pool, state);

diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index a48ce2e..a667110 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -138,14 +138,14 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)

* tag_pool_init()), or otherwise -ENOSPC on allocation failure.
*
* Safe to be called from interrupt context (assuming it isn't passed

- * TASK_UNINTERRUPTIBLE, of course).
+ * TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, of course).

*
* @gfp indicates whether or not to wait until a free id is available (it's not
* used for internal memory allocations); thus if passed __GFP_WAIT we may sleep
* however long it takes until another thread frees an id (same semantics as a
* mempool).
*

- * Will not fail if passed TASK_UNINTERRUPTIBLE.
+ * Will not fail if passed TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE.
*/

int percpu_ida_alloc(struct percpu_ida *pool, int state)

{
@@ -194,6 +194,11 @@ int percpu_ida_alloc(struct percpu_ida *pool, int state)

if (tag >= 0 || state == TASK_RUNNING)
break;

+ if (signal_pending_state(state, current)) {
+ tag = -ERESTARTSYS;
+ break;
+ }
+
schedule();

local_irq_save(flags);

Nicholas A. Bellinger

unread,

Jan 19, 2014, 11:10:02 PM1/19/14

to

From: Nicholas Bellinger <n...@linux-iscsi.org>

This patch propigates the use of task state bitmask for
percpu_ida_alloc() up the blk-mq callchain, to the point in
blk_get_request() where the blk-mq vs. blk-old split occurs.

Along with the obvious parameters changes, there are two cases
in mq_flush_work() + blk_mq_make_request() where the original
code was using __GFP_WAIT|GFP_ATOMIC that always expect a tag
which have been converted to TASK_UNINTERRUPTIBLE.

Reported-by: Linus Torvalds <torv...@linux-foundation.org>
Cc: Jens Axboe <ax...@kernel.dk>

Signed-off-by: Nicholas Bellinger <n...@linux-iscsi.org>
---

block/blk-core.c | 4 +++-
block/blk-flush.c | 2 +-

block/blk-mq-tag.c | 16 +++++++---------

block/blk-mq-tag.h | 2 +-
block/blk-mq.c | 28 ++++++++++++++--------------

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 5d70edc..20777bd 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c

@@ -32,19 +32,18 @@ bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
percpu_ida_free_tags(&tags->free_tags, nr_cpu_ids) != 0;
}

-static unsigned int __blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp)
+static unsigned int __blk_mq_get_tag(struct blk_mq_tags *tags, int state)
{
int tag;

- tag = percpu_ida_alloc(&tags->free_tags, (gfp & __GFP_WAIT) ?
- TASK_UNINTERRUPTIBLE : TASK_RUNNING);

+ tag = percpu_ida_alloc(&tags->free_tags, state);

if (tag < 0)
return BLK_MQ_TAG_FAIL;
return tag + tags->nr_reserved_tags;
}

static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_tags *tags,

- gfp_t gfp)
+ int state)
{
int tag;

@@ -53,19 +52,18 @@ static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_tags *tags,
return BLK_MQ_TAG_FAIL;
}

- tag = percpu_ida_alloc(&tags->reserved_tags, (gfp & __GFP_WAIT) ?
- TASK_UNINTERRUPTIBLE : TASK_RUNNING);
+ tag = percpu_ida_alloc(&tags->reserved_tags, state);

if (tag < 0)
return BLK_MQ_TAG_FAIL;

Nicholas A. Bellinger

unread,

Jan 19, 2014, 11:10:03 PM1/19/14

to

From: Kent Overstreet <k...@daterainc.com>

This patch changes percpu_ida_alloc() + callers to accept task state
bitmask for prepare_to_wait() for code like target/iscsi that needs
it for interruptible sleep, that is provided in a subsequent patch.

It now expects TASK_UNINTERRUPTIBLE when the caller is able to sleep

waiting for a new tag, or TASK_RUNNING when the caller cannot sleep,

and is forced to return a negative value when no tags are available.

v2 changes:
- Include blk-mq + tcm_fc + vhost/scsi + target/iscsi changes
- Drop signal_pending_state() call

Reported-by: Linus Torvalds <torv...@linux-foundation.org>

Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>

Cc: Jens Axboe <ax...@kernel.dk>
Signed-off-by: Kent Overstreet <k...@daterainc.com>
Cc: <sta...@vger.kernel.org> #3.12+

Signed-off-by: Nicholas Bellinger <n...@linux-iscsi.org>
---

block/blk-mq-tag.c | 6 ++++--

drivers/target/iscsi/iscsi_target_util.c | 8 ++++++--
drivers/target/tcm_fc/tfc_cmd.c | 2 +-
drivers/vhost/scsi.c | 2 +-

include/linux/percpu_ida.h | 3 ++-
lib/percpu_ida.c | 12 ++++++------
6 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d64a02f..5d70edc 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -36,7 +36,8 @@ static unsigned int __blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp)
{
int tag;

- tag = percpu_ida_alloc(&tags->free_tags, gfp);
+ tag = percpu_ida_alloc(&tags->free_tags, (gfp & __GFP_WAIT) ?
+ TASK_UNINTERRUPTIBLE : TASK_RUNNING);

if (tag < 0)
return BLK_MQ_TAG_FAIL;
return tag + tags->nr_reserved_tags;

@@ -52,7 +53,8 @@ static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_tags *tags,
return BLK_MQ_TAG_FAIL;
}

- tag = percpu_ida_alloc(&tags->reserved_tags, gfp);
+ tag = percpu_ida_alloc(&tags->reserved_tags, (gfp & __GFP_WAIT) ?
+ TASK_UNINTERRUPTIBLE : TASK_RUNNING);

if (tag < 0)
return BLK_MQ_TAG_FAIL;

return tag;
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 0819e68..9b8e1db 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -156,9 +156,13 @@ struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *conn, gfp_t gfp_mask)

{
struct iscsi_cmd *cmd;
struct se_session *se_sess = conn->sess->se_sess;

- int size, tag;
+ int size, tag, state = (gfp_mask & __GFP_WAIT) ? TASK_UNINTERRUPTIBLE :

+ TASK_RUNNING;
+
+ tag = percpu_ida_alloc(&se_sess->sess_tag_pool, state);
+ if (tag < 0)
+ return NULL;

- tag = percpu_ida_alloc(&se_sess->sess_tag_pool, gfp_mask);
size = sizeof(struct iscsi_cmd) + conn->conn_transport->priv_size;
cmd = (struct iscsi_cmd *)(se_sess->sess_cmd_map + (tag * size));
memset(cmd, 0, size);

diff --git a/drivers/target/tcm_fc/tfc_cmd.c b/drivers/target/tcm_fc/tfc_cmd.c
index 479ec56..8b2c1aa 100644
--- a/drivers/target/tcm_fc/tfc_cmd.c
+++ b/drivers/target/tcm_fc/tfc_cmd.c
@@ -438,7 +438,7 @@ static void ft_recv_cmd(struct ft_sess *sess, struct fc_frame *fp)
struct se_session *se_sess = sess->se_sess;
int tag;

- tag = percpu_ida_alloc(&se_sess->sess_tag_pool, GFP_ATOMIC);
+ tag = percpu_ida_alloc(&se_sess->sess_tag_pool, TASK_RUNNING);
if (tag < 0)

goto busy;

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 84488a8..2d084fb 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -728,7 +728,7 @@ vhost_scsi_get_tag(struct vhost_virtqueue *vq,
}
se_sess = tv_nexus->tvn_se_sess;

- tag = percpu_ida_alloc(&se_sess->sess_tag_pool, GFP_ATOMIC);
+ tag = percpu_ida_alloc(&se_sess->sess_tag_pool, TASK_RUNNING);
if (tag < 0) {
pr_err("Unable to obtain tag for tcm_vhost_cmd\n");
return ERR_PTR(-ENOMEM);

diff --git a/include/linux/percpu_ida.h b/include/linux/percpu_ida.h
index 1900bd0..f5cfdd6 100644
--- a/include/linux/percpu_ida.h
+++ b/include/linux/percpu_ida.h
@@ -4,6 +4,7 @@
#include <linux/types.h>
#include <linux/bitops.h>
#include <linux/init.h>
+#include <linux/sched.h>
#include <linux/spinlock_types.h>
#include <linux/wait.h>
#include <linux/cpumask.h>
@@ -61,7 +62,7 @@ struct percpu_ida {
/* Max size of percpu freelist, */
#define IDA_DEFAULT_PCPU_SIZE ((IDA_DEFAULT_PCPU_BATCH_MOVE * 3) / 2)

-int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp);
+int percpu_ida_alloc(struct percpu_ida *pool, int state);
void percpu_ida_free(struct percpu_ida *pool, unsigned tag);

void percpu_ida_destroy(struct percpu_ida *pool);

diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index 9d054bf..a48ce2e 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c

@@ -132,22 +132,22 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
/**
* percpu_ida_alloc - allocate a tag
* @pool: pool to allocate from
- * @gfp: gfp flags
+ * @state: task state for prepare_to_wait
*
* Returns a tag - an integer in the range [0..nr_tags) (passed to

* tag_pool_init()), or otherwise -ENOSPC on allocation failure.
*
* Safe to be called from interrupt context (assuming it isn't passed

- * __GFP_WAIT, of course).

+ * TASK_UNINTERRUPTIBLE, of course).

*
* @gfp indicates whether or not to wait until a free id is available (it's not
* used for internal memory allocations); thus if passed __GFP_WAIT we may sleep
* however long it takes until another thread frees an id (same semantics as a
* mempool).
*

- * Will not fail if passed __GFP_WAIT.
+ * Will not fail if passed TASK_UNINTERRUPTIBLE.

*/
-int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
+int percpu_ida_alloc(struct percpu_ida *pool, int state)
{
DEFINE_WAIT(wait);
struct percpu_ida_cpu *tags;
@@ -174,7 +174,7 @@ int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
*
* global lock held and irqs disabled, don't need percpu lock
*/
- prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
+ prepare_to_wait(&pool->wait, &wait, state);

if (!tags->nr_free)
alloc_global_tags(pool, tags);

@@ -191,7 +191,7 @@ int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)

spin_unlock(&pool->lock);
local_irq_restore(flags);

- if (tag >= 0 || !(gfp & __GFP_WAIT))

+ if (tag >= 0 || state == TASK_RUNNING)
break;

schedule();

Nicholas A. Bellinger

unread,

Jan 19, 2014, 11:20:02 PM1/19/14

to

On Sun, 2014-01-19 at 18:38 -0800, Linus Torvalds wrote:
> On Sun, Jan 19, 2014 at 2:16 AM, Nicholas A. Bellinger
> <n...@linux-iscsi.org> wrote:
> >
> > This patch changes percpu_ida_alloc() to accept task state bitmask
> > for prepare_to_wait() to support interruptible sleep for callers
> > that require it.
>
> This patch-series is not bisectable. Afaik, the first patch will break
> the build (or at least cause the end result to not actually work).
>
> This kind of "split up one large patch into many small patches THAT
> DON'T ACTUALLY WORK INDIVIDUALLY" model is pure and utter garbage.
>
> So a big NAK on this series as being completely broken.
>

So the late night reasoning was to allow the patches to apply cleanly to
stable. That was, indeed, a bad decision.

> To fix it, I would suggest:
>
> - make the first patch change all *existing* users (that only have
> the atomic vs uninterruptible semantics) pass in either
> TASK_UNINTERRUPTIBLE or TASK_RUNNING depending on whether they had
> __GFP_WAIT or not.
>
> So the first patch would not change *any* semantics or behavior, it
> would only change the calling convention.
>
> - do the cleanup patches to block/blk-mq-tag.c to not have those
> "gfp" calling convention, and instead passing in the state natively
>
> - add the TASK_INTERRUPTIBLE case last (which includes the new
> "signal_pending_state()" logic in percpu_ida_alloc())
>
> that way, all patches compile cleanly and should each work
> individually, and they all do clearly just one thing. And the biggest
> patch in the series (the first one) doesn't actually make any semantic
> changes.

-v2 sent out. Please review.

Jens, please review the blk-mq specific changes in patch #1, and let me
know if you'd like to pick-up #2 via the block tree, or have it included
in target-pending/for-next.

Thanks,

--nab

Peter Zijlstra

unread,

Jan 20, 2014, 6:40:02 AM1/20/14

to

On Mon, Jan 20, 2014 at 03:44:44AM +0000, Nicholas A. Bellinger wrote:
> From: Kent Overstreet <k...@daterainc.com>
>
> This patch changes percpu_ida_alloc() + callers to accept task state
> bitmask for prepare_to_wait() for code like target/iscsi that needs
> it for interruptible sleep, that is provided in a subsequent patch.
>
> It now expects TASK_UNINTERRUPTIBLE when the caller is able to sleep
> waiting for a new tag, or TASK_RUNNING when the caller cannot sleep,
> and is forced to return a negative value when no tags are available.
>
> v2 changes:
> - Include blk-mq + tcm_fc + vhost/scsi + target/iscsi changes
> - Drop signal_pending_state() call

Urgh, you made me look at percpu_ida... steal_tags() does a
for_each_cpus() with IRQs disabled. This mean you'll disable IRQs for
multiple ticks on SGI class hardware. That is a _very_ long time indeed.

Then there's alloc_global_tags() vs alloc_local_tags(), one gets an
actual tag, while the other only moves tags about -- semantic mismatch.

I do not get the comment near prepare to wait -- why does it matter if
percpu_ida_free() flips a cpus_have_tags bit?

Given I don't understand this comment, its hard for me to properly
review the proposed patch series.

Help?

Nicholas A. Bellinger

unread,

Jan 21, 2014, 5:10:02 PM1/21/14

to

On Mon, 2014-01-20 at 12:34 +0100, Peter Zijlstra wrote:
> On Mon, Jan 20, 2014 at 03:44:44AM +0000, Nicholas A. Bellinger wrote:
> > From: Kent Overstreet <k...@daterainc.com>
> >
> > This patch changes percpu_ida_alloc() + callers to accept task state
> > bitmask for prepare_to_wait() for code like target/iscsi that needs
> > it for interruptible sleep, that is provided in a subsequent patch.
> >
> > It now expects TASK_UNINTERRUPTIBLE when the caller is able to sleep
> > waiting for a new tag, or TASK_RUNNING when the caller cannot sleep,
> > and is forced to return a negative value when no tags are available.
> >
> > v2 changes:
> > - Include blk-mq + tcm_fc + vhost/scsi + target/iscsi changes
> > - Drop signal_pending_state() call
>
> Urgh, you made me look at percpu_ida... steal_tags() does a
> for_each_cpus() with IRQs disabled. This mean you'll disable IRQs for
> multiple ticks on SGI class hardware. That is a _very_ long time indeed.
>

So given the performance penalties involved in the steal tag slow path,
consumers should typically be pre-allocating a larger number of
percpu_ida tags than necessary to (ideally) avoid this logic completely.

> Then there's alloc_global_tags() vs alloc_local_tags(), one gets an
> actual tag, while the other only moves tags about -- semantic mismatch.
>

How about just in-lining alloc_global_tags() into percpu_ida_alloc()..?

> I do not get the comment near prepare to wait -- why does it matter if
> percpu_ida_free() flips a cpus_have_tags bit?
>

Mmm, not sure on that one.

> Given I don't understand this comment, its hard for me to properly
> review the proposed patch series.
>

Kent..?

--nab

Kent Overstreet

unread,

Jan 21, 2014, 5:20:03 PM1/21/14

to

On Mon, Jan 20, 2014 at 12:34:15PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 20, 2014 at 03:44:44AM +0000, Nicholas A. Bellinger wrote:
> > From: Kent Overstreet <k...@daterainc.com>
> >
> > This patch changes percpu_ida_alloc() + callers to accept task state
> > bitmask for prepare_to_wait() for code like target/iscsi that needs
> > it for interruptible sleep, that is provided in a subsequent patch.
> >
> > It now expects TASK_UNINTERRUPTIBLE when the caller is able to sleep
> > waiting for a new tag, or TASK_RUNNING when the caller cannot sleep,
> > and is forced to return a negative value when no tags are available.
> >
> > v2 changes:
> > - Include blk-mq + tcm_fc + vhost/scsi + target/iscsi changes
> > - Drop signal_pending_state() call
>
> Urgh, you made me look at percpu_ida... steal_tags() does a
> for_each_cpus() with IRQs disabled. This mean you'll disable IRQs for
> multiple ticks on SGI class hardware. That is a _very_ long time indeed.

It's not that bad in practice - the looping is limited by the number of other
CPUs that actually have tags on their freelists - i.e. the CPUs that have
recently been using that block device or whatever the percpu_ida is for. And we
loop while cpu_have_tags is greater than some threshold (there's another debate
about that) - the intention is not to steal tags unless too many other CPUs have
tags on their local freelists.

That said, for huge SGI class hardware I think you'd want the freelists to not
be percpu, but rather be per core or something - that's probably a reasonable
optimization for most hardware anyways.

> Then there's alloc_global_tags() vs alloc_local_tags(), one gets an
> actual tag, while the other only moves tags about -- semantic mismatch.

Yeah, kind of. It is doing allocation, but not the same sort of allocation.

> I do not get the comment near prepare to wait -- why does it matter if
> percpu_ida_free() flips a cpus_have_tags bit?

Did I write that comment? It is a crappy comment...

Ok, in userspace we'd be using condition variables here, but this is the kernel
so we need to carefully order putting ourselves on a waitlist, and checking the
condition that determines whether we wait, and on the wakeup end changing things
that affect that condition and doing the wakeup. steal_tags() is checking the
condition that goes with the prepare_to_wait(), that's all.

Nicholas A. Bellinger

unread,

Jan 22, 2014, 3:00:02 PM1/22/14

to

Hi Peter,

Does this satisfy your questions..?

Do you have any more concerns about TASK_RUNNING + prepare_to_wait()
usage in percpu_ida_alloc() that need to be addressed before I can drop
this series into target-pending/for-next to address the original bug..?

Thank you,

--nab

> To unsubscribe from this list: send the line "unsubscribe target-devel" in

Nicholas A. Bellinger

unread,

Jan 22, 2014, 3:00:03 PM1/22/14

to

Hey Jens,

On Mon, 2014-01-20 at 03:44 +0000, Nicholas A. Bellinger wrote:
> From: Nicholas Bellinger <n...@linux-iscsi.org>
>
> Hi Linus,
>
> Here is the -v2 series for converting percpu_ida_alloc() + consumer
> usage to accept the task state bitmask parameter, w/o the extra
> legacy gfp_t wrapper.
>
> As requested, the first patch contains only the parameter change
> to percpu_ida_alloc() + existing consumers, and makes no semantic
> or behavior change. This patch is CC'ed for stable, and will need
> some light massaging to apply back to v3.12.y.
>

Any objections to the blk-mq changes in patch #1..? Please ACK. :)

> The second patch is a blk-mq cleanup to propigate the task state
> bitmask usage up to the blk-mq vs. legacy split in blk_get_request().
>

How about this one..? Do you want to pick it up for v3.15, or have it
included in target-pending/for-next now..?

Thanks,

--nab

Peter Zijlstra

unread,

Jan 23, 2014, 7:50:02 AM1/23/14

to

On Tue, Jan 21, 2014 at 02:18:52PM -0800, Kent Overstreet wrote:
> > I do not get the comment near prepare to wait -- why does it matter if
> > percpu_ida_free() flips a cpus_have_tags bit?
>
> Did I write that comment? It is a crappy comment...
>
> Ok, in userspace we'd be using condition variables here, but this is the kernel
> so we need to carefully order putting ourselves on a waitlist, and checking the
> condition that determines whether we wait, and on the wakeup end changing things
> that affect that condition and doing the wakeup. steal_tags() is checking the
> condition that goes with the prepare_to_wait(), that's all.

How about something like so?

---
lib/percpu_ida.c | 126 ++++++++++++++++++++++++++++++-------------------------
1 file changed, 70 insertions(+), 56 deletions(-)

diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index 9d054bf91d0f..fc13d70b5b5e 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -68,6 +68,8 @@ static inline void steal_tags(struct percpu_ida *pool,
unsigned cpus_have_tags, cpu = pool->cpu_last_stolen;
struct percpu_ida_cpu *remote;

+ lockdep_assert_held(&pool->lock);
+
for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags);
cpus_have_tags * pool->percpu_max_size > pool->nr_tags / 2;
cpus_have_tags--) {
@@ -105,19 +107,7 @@ static inline void steal_tags(struct percpu_ida *pool,
}
}

-/*
- * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto
- * our percpu freelist:
- */
-static inline void alloc_global_tags(struct percpu_ida *pool,
- struct percpu_ida_cpu *tags)
-{
- move_tags(tags->freelist, &tags->nr_free,
- pool->freelist, &pool->nr_free,
- min(pool->nr_free, pool->percpu_batch_size));
-}
-
-static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
+static inline int alloc_local_tag(struct percpu_ida_cpu *tags)
{
int tag = -ENOSPC;

@@ -129,6 +119,50 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
return tag;
}

+static inline int
+__alloc_global_tag(struct percpu_ida *pool, struct percpu_ida_cpu *tags)
+{
+ int tag = -ENOSPC;
+
+ spin_lock(&pool->lock);
+
+ if (!tags->nr_free) {
+ /*
+ * Move tags from the global-, onto our percpu-freelist.
+ */
+ move_tags(tags->freelist, &tags->nr_free,
+ pool->freelist, &pool->nr_free,
+ min(pool->nr_free, pool->percpu_batch_size));
+ }
+
+ if (!tags->nr_free)
+ steal_tags(pool, tags);
+
+ if (tags->nr_free) {
+ tag = tags->freelist[--tags->nr_free];
+ if (tags->nr_free) {
+ cpumask_set_cpu(smp_processor_id(),
+ &pool->cpus_have_tags);
+ }
+ }
+
+ spin_unlock(&pool->lock);
+
+ return tag;
+}
+
+static inline int alloc_global_tag(struct percpu_ida *pool)
+{
+ unsigned long flags;
+ int tag;
+
+ local_irq_safe(flags);
+ tag = __alloc_global_tag(pool, this_cpu_ptr(pool->tag_cpu));
+ local_irq_restore(flags);
+
+ return tag;
+}
+

/**
* percpu_ida_alloc - allocate a tag
* @pool: pool to allocate from

@@ -147,61 +181,34 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
*

* Will not fail if passed __GFP_WAIT.

*/
-int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
+int percpu_ida_alloc(struct percpu_ida *pool, int state)
{

- DEFINE_WAIT(wait);
struct percpu_ida_cpu *tags;
unsigned long flags;
- int tag;
+ int tag, ret = -ENOSPC;

local_irq_save(flags);
tags = this_cpu_ptr(pool->tag_cpu);

- /* Fastpath */
tag = alloc_local_tag(tags);
- if (likely(tag >= 0)) {
- local_irq_restore(flags);
- return tag;
- }
+ if (unlikely(tag < 0))
+ tag = __alloc_global_tag(pool, tags);

- while (1) {
- spin_lock(&pool->lock);
-
- /*
- * prepare_to_wait() must come before steal_tags(), in case
- * percpu_ida_free() on another cpu flips a bit in
- * cpus_have_tags
- *
- * global lock held and irqs disabled, don't need percpu lock
- */

- prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);

-
- if (!tags->nr_free)
- alloc_global_tags(pool, tags);
- if (!tags->nr_free)
- steal_tags(pool, tags);
-
- if (tags->nr_free) {
- tag = tags->freelist[--tags->nr_free];
- if (tags->nr_free)
- cpumask_set_cpu(smp_processor_id(),
- &pool->cpus_have_tags);
- }
-
- spin_unlock(&pool->lock);
- local_irq_restore(flags);
+ local_irq_restore(flags);

- if (tag >= 0 || !(gfp & __GFP_WAIT))

- break;
+ if (likely(tag >= 0))
+ return tag;

- schedule();
+ if (state != TASK_RUNNING) {
+ ret = ___wait_event(pool->wait,
+ (tag = alloc_global_tag(pool)) >= 0,
+ state, 0, 0, schedule());

- local_irq_save(flags);
- tags = this_cpu_ptr(pool->tag_cpu);
+ if (tag >= 0)
+ ret = tag;
}

- finish_wait(&pool->wait, &wait);
- return tag;
+ return ret;
}
EXPORT_SYMBOL_GPL(percpu_ida_alloc);

@@ -235,12 +242,19 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned tag)
wake_up(&pool->wait);
}

+ /*
+ * No point in waking when 1 < n < max_size free, because steal_tags()
+ * will assume max_size per set bit, having more free will not make it
+ * more or less likely it will visit this cpu.
+ */
+
if (nr_free == pool->percpu_max_size) {
spin_lock(&pool->lock);

/*
- * Global lock held and irqs disabled, don't need percpu
- * lock
+ * Global lock held and irqs disabled, don't need percpu lock
+ * because everybody accessing remote @tags will hold
+ * pool->lock -- steal_tags().
*/
if (tags->nr_free == pool->percpu_max_size) {
move_tags(pool->freelist, &pool->nr_free,

Kent Overstreet

unread,

Jan 23, 2014, 8:30:02 AM1/23/14

to

On Thu, Jan 23, 2014 at 01:47:53PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 02:18:52PM -0800, Kent Overstreet wrote:
> > > I do not get the comment near prepare to wait -- why does it matter if
> > > percpu_ida_free() flips a cpus_have_tags bit?
> >
> > Did I write that comment? It is a crappy comment...
> >
> > Ok, in userspace we'd be using condition variables here, but this is the kernel
> > so we need to carefully order putting ourselves on a waitlist, and checking the
> > condition that determines whether we wait, and on the wakeup end changing things
> > that affect that condition and doing the wakeup. steal_tags() is checking the
> > condition that goes with the prepare_to_wait(), that's all.
>
> How about something like so?

I like it - my only concern is that your patch has the effect of calling
__alloc_global_tag() twice before sleeping on alloc failure - given that
we're also doing a prepare_to_wait() I'm not concerned about touching
the global freelist twice, but we're also calling steal_tags() twice and
that's potentially more expensive.

It should be ok, because I expect when steal_tags() is going to fail
most of the time it'll check the bitmap and not run the loop, but I
think there's enough room for pathological behaviour here to sleep on
it.

pool->lock is also going to be fairly badly contended in the worst case,
and that can get real bad real fast... now that I think about it we
probably want to avoid the __alloc_global_tag() double call just because
of that, pool->lock is going to be quite a bit more contended than the
waitlist lock just because fo the amount of work done under it.

though my old code was also calling prepare_to_wait() with pool->lock
held, which was dumb.

Kent Overstreet

unread,

Jan 23, 2014, 9:00:02 AM1/23/14

to

On Thu, Jan 23, 2014 at 02:50:03PM +0100, Peter Zijlstra wrote:

> On Thu, Jan 23, 2014 at 05:28:29AM -0800, Kent Overstreet wrote:
> > pool->lock is also going to be fairly badly contended in the worst case,
> > and that can get real bad real fast... now that I think about it we
> > probably want to avoid the __alloc_global_tag() double call just because
> > of that, pool->lock is going to be quite a bit more contended than the
> > waitlist lock just because fo the amount of work done under it.
>

> OK, how about this then.. Not quite at pretty though

Heh, I suppose that is a solution. Let me screw around to see what I can
come up with tomorrow, but if I can't come up with anything I like I'm
not opposed to this.

> ---
> lib/percpu_ida.c | 128 ++++++++++++++++++++++++++++---------------------------
> 1 file changed, 65 insertions(+), 63 deletions(-)
>
> diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
> index 9d054bf91d0f..a1279622436a 100644

> --- a/lib/percpu_ida.c
> +++ b/lib/percpu_ida.c
> @@ -68,6 +68,8 @@ static inline void steal_tags(struct percpu_ida *pool,
> unsigned cpus_have_tags, cpu = pool->cpu_last_stolen;
> struct percpu_ida_cpu *remote;
>
> + lockdep_assert_held(&pool->lock);
> +
> for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags);
> cpus_have_tags * pool->percpu_max_size > pool->nr_tags / 2;
> cpus_have_tags--) {

> @@ -105,26 +107,52 @@ static inline void steal_tags(struct percpu_ida *pool,

> }
> }
>
> -/*
> - * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto
> - * our percpu freelist:
> - */
> -static inline void alloc_global_tags(struct percpu_ida *pool,
> - struct percpu_ida_cpu *tags)

> +static inline int alloc_local_tag(struct percpu_ida *pool)

> {
> - move_tags(tags->freelist, &tags->nr_free,
> - pool->freelist, &pool->nr_free,
> - min(pool->nr_free, pool->percpu_batch_size));

> + struct percpu_ida_cpu *tags;
> + unsigned long flags;
> + int tag = -ENOSPC;
> +
> + local_irq_save(flags);
> + tags = this_cpu_ptr(pool->tag_cpu);
> + spin_lock(&tags->lock);

> + if (tags->nr_free)
> + tag = tags->freelist[--tags->nr_free];

> + spin_unlock_irqrestore(&tags->lock, flags);
> +
> + return tag;

> }
>
> -static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)

> +static inline int alloc_global_tag(struct percpu_ida *pool)
> {
> + struct percpu_ida_cpu *tags;
> + unsigned long flags;
> int tag = -ENOSPC;
>
> - spin_lock(&tags->lock);
> - if (tags->nr_free)
> + spin_lock_irqsave(&pool->lock, flags);
> + tags = this_cpu_ptr(pool->tag_cpu);
> +

> + if (!tags->nr_free) {
> + /*
> + * Move tags from the global-, onto our percpu-freelist.
> + */
> + move_tags(tags->freelist, &tags->nr_free,
> + pool->freelist, &pool->nr_free,
> + min(pool->nr_free, pool->percpu_batch_size));
> + }
> +
> + if (!tags->nr_free)
> + steal_tags(pool, tags);
> +
> + if (tags->nr_free) {

> tag = tags->freelist[--tags->nr_free];

> - spin_unlock(&tags->lock);

> + if (tags->nr_free) {
> + cpumask_set_cpu(smp_processor_id(),
> + &pool->cpus_have_tags);
> + }
> + }
> +

> + spin_unlock_irqrestore(&pool->lock, flags);
>
> return tag;
> }
> @@ -147,61 +175,28 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)

> *
> * Will not fail if passed __GFP_WAIT.
> */
> -int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
> +int percpu_ida_alloc(struct percpu_ida *pool, int state)
> {
> - DEFINE_WAIT(wait);

> - struct percpu_ida_cpu *tags;
> - unsigned long flags;
> - int tag;
> -

> - local_irq_save(flags);
> - tags = this_cpu_ptr(pool->tag_cpu);

> -
> - /* Fastpath */
> - tag = alloc_local_tag(tags);

> - if (likely(tag >= 0)) {
> - local_irq_restore(flags);
> - return tag;
> - }

> -

> - while (1) {
> - spin_lock(&pool->lock);

> + int ret;

>
> - /*
> - * prepare_to_wait() must come before steal_tags(), in case
> - * percpu_ida_free() on another cpu flips a bit in
> - * cpus_have_tags
> - *
> - * global lock held and irqs disabled, don't need percpu lock
> - */
> - prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);

> + ret = alloc_local_tag(pool);
> + if (likely(ret >= 0))
> + return ret;

>
> - if (!tags->nr_free)
> - alloc_global_tags(pool, tags);
> - if (!tags->nr_free)
> - steal_tags(pool, tags);

> + if (state != TASK_RUNNING) {

> + int tag;

>
> - if (tags->nr_free) {
> - tag = tags->freelist[--tags->nr_free];
> - if (tags->nr_free)
> - cpumask_set_cpu(smp_processor_id(),
> - &pool->cpus_have_tags);
> - }

> + ret = ___wait_event(pool->wait,
> + (tag = alloc_global_tag(pool)) >= 0,
> + state, 0, 0, schedule());
>

> - spin_unlock(&pool->lock);
> - local_irq_restore(flags);
> -

> - if (tag >= 0 || !(gfp & __GFP_WAIT))
> - break;

> -
> - schedule();
> -

> - local_irq_save(flags);
> - tags = this_cpu_ptr(pool->tag_cpu);
> + if (tag >= 0)
> + ret = tag;

> + } else {
> + ret = alloc_global_tag(pool);

> }
>
> - finish_wait(&pool->wait, &wait);
> - return tag;
> + return ret;
> }
> EXPORT_SYMBOL_GPL(percpu_ida_alloc);
>

> @@ -235,12 +230,19 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned tag)

> wake_up(&pool->wait);
> }
>
> + /*
> + * No point in waking when 1 < n < max_size free, because steal_tags()
> + * will assume max_size per set bit, having more free will not make it
> + * more or less likely it will visit this cpu.
> + */
> +
> if (nr_free == pool->percpu_max_size) {
> spin_lock(&pool->lock);
>
> /*
> - * Global lock held and irqs disabled, don't need percpu
> - * lock
> + * Global lock held and irqs disabled, don't need percpu lock
> + * because everybody accessing remote @tags will hold
> + * pool->lock -- steal_tags().
> */
> if (tags->nr_free == pool->percpu_max_size) {
> move_tags(pool->freelist, &pool->nr_free,

Peter Zijlstra

unread,

Jan 23, 2014, 9:00:02 AM1/23/14

to

On Thu, Jan 23, 2014 at 05:28:29AM -0800, Kent Overstreet wrote:

> pool->lock is also going to be fairly badly contended in the worst case,
> and that can get real bad real fast... now that I think about it we
> probably want to avoid the __alloc_global_tag() double call just because
> of that, pool->lock is going to be quite a bit more contended than the
> waitlist lock just because fo the amount of work done under it.

OK, how about this then.. Not quite at pretty though

---
lib/percpu_ida.c | 128 ++++++++++++++++++++++++++++---------------------------
1 file changed, 65 insertions(+), 63 deletions(-)

diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index 9d054bf91d0f..a1279622436a 100644

--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -68,6 +68,8 @@ static inline void steal_tags(struct percpu_ida *pool,
unsigned cpus_have_tags, cpu = pool->cpu_last_stolen;
struct percpu_ida_cpu *remote;

+ lockdep_assert_held(&pool->lock);
+
for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags);
cpus_have_tags * pool->percpu_max_size > pool->nr_tags / 2;
cpus_have_tags--) {

@@ -105,26 +107,52 @@ static inline void steal_tags(struct percpu_ida *pool,

}
}

-/*
- * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto
- * our percpu freelist:
- */
-static inline void alloc_global_tags(struct percpu_ida *pool,
- struct percpu_ida_cpu *tags)

+static inline int alloc_local_tag(struct percpu_ida *pool)
{

- move_tags(tags->freelist, &tags->nr_free,
- pool->freelist, &pool->nr_free,
- min(pool->nr_free, pool->percpu_batch_size));

+ struct percpu_ida_cpu *tags;
+ unsigned long flags;

+ int tag = -ENOSPC;
+
+ local_irq_save(flags);

+ tags = this_cpu_ptr(pool->tag_cpu);
+ spin_lock(&tags->lock);

+ if (tags->nr_free)
+ tag = tags->freelist[--tags->nr_free];

+ spin_unlock_irqrestore(&tags->lock, flags);
+
+ return tag;
}

-static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)

+static inline int alloc_global_tag(struct percpu_ida *pool)
{
+ struct percpu_ida_cpu *tags;
+ unsigned long flags;
int tag = -ENOSPC;

- spin_lock(&tags->lock);
- if (tags->nr_free)
+ spin_lock_irqsave(&pool->lock, flags);

+ tags = this_cpu_ptr(pool->tag_cpu);
+

+ if (!tags->nr_free) {
+ /*
+ * Move tags from the global-, onto our percpu-freelist.
+ */
+ move_tags(tags->freelist, &tags->nr_free,
+ pool->freelist, &pool->nr_free,
+ min(pool->nr_free, pool->percpu_batch_size));
+ }
+
+ if (!tags->nr_free)
+ steal_tags(pool, tags);
+
+ if (tags->nr_free) {

tag = tags->freelist[--tags->nr_free];

- spin_unlock(&tags->lock);

+ if (tags->nr_free) {
+ cpumask_set_cpu(smp_processor_id(),
+ &pool->cpus_have_tags);
+ }
+ }
+

+ spin_unlock_irqrestore(&pool->lock, flags);

return tag;
}

@@ -147,61 +175,28 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)

*
* Will not fail if passed __GFP_WAIT.
*/
-int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
+int percpu_ida_alloc(struct percpu_ida *pool, int state)
{
- DEFINE_WAIT(wait);

- struct percpu_ida_cpu *tags;
- unsigned long flags;
- int tag;
-

- local_irq_save(flags);
- tags = this_cpu_ptr(pool->tag_cpu);

-
- /* Fastpath */

- tag = alloc_local_tag(tags);

- if (likely(tag >= 0)) {
- local_irq_restore(flags);
- return tag;
- }

-

- while (1) {
- spin_lock(&pool->lock);

+ int ret;

- /*
- * prepare_to_wait() must come before steal_tags(), in case
- * percpu_ida_free() on another cpu flips a bit in
- * cpus_have_tags
- *
- * global lock held and irqs disabled, don't need percpu lock
- */
- prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);

+ ret = alloc_local_tag(pool);
+ if (likely(ret >= 0))
+ return ret;

- if (!tags->nr_free)
- alloc_global_tags(pool, tags);
- if (!tags->nr_free)
- steal_tags(pool, tags);

+ if (state != TASK_RUNNING) {

+ int tag;

- if (tags->nr_free) {
- tag = tags->freelist[--tags->nr_free];
- if (tags->nr_free)
- cpumask_set_cpu(smp_processor_id(),
- &pool->cpus_have_tags);
- }

+ ret = ___wait_event(pool->wait,
+ (tag = alloc_global_tag(pool)) >= 0,
+ state, 0, 0, schedule());

- spin_unlock(&pool->lock);
- local_irq_restore(flags);
-

- if (tag >= 0 || !(gfp & __GFP_WAIT))
- break;

-
- schedule();
-

- local_irq_save(flags);
- tags = this_cpu_ptr(pool->tag_cpu);
+ if (tag >= 0)
+ ret = tag;

+ } else {
+ ret = alloc_global_tag(pool);
}

- finish_wait(&pool->wait, &wait);
- return tag;
+ return ret;
}
EXPORT_SYMBOL_GPL(percpu_ida_alloc);

@@ -235,12 +230,19 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned tag)

wake_up(&pool->wait);
}

+ /*
+ * No point in waking when 1 < n < max_size free, because steal_tags()
+ * will assume max_size per set bit, having more free will not make it
+ * more or less likely it will visit this cpu.
+ */
+
if (nr_free == pool->percpu_max_size) {
spin_lock(&pool->lock);

/*
- * Global lock held and irqs disabled, don't need percpu
- * lock
+ * Global lock held and irqs disabled, don't need percpu lock
+ * because everybody accessing remote @tags will hold
+ * pool->lock -- steal_tags().
*/
if (tags->nr_free == pool->percpu_max_size) {
move_tags(pool->freelist, &pool->nr_free,

Peter Zijlstra

unread,

Jan 23, 2014, 10:50:02 AM1/23/14

to

On Thu, Jan 23, 2014 at 05:55:39AM -0800, Kent Overstreet wrote:
> On Thu, Jan 23, 2014 at 02:50:03PM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 23, 2014 at 05:28:29AM -0800, Kent Overstreet wrote:
> > > pool->lock is also going to be fairly badly contended in the worst case,
> > > and that can get real bad real fast... now that I think about it we
> > > probably want to avoid the __alloc_global_tag() double call just because
> > > of that, pool->lock is going to be quite a bit more contended than the
> > > waitlist lock just because fo the amount of work done under it.
> >
> > OK, how about this then.. Not quite at pretty though
>
> Heh, I suppose that is a solution. Let me screw around to see what I can
> come up with tomorrow, but if I can't come up with anything I like I'm
> not opposed to this.

Please also consider the below patch.

---
Subject: percpu-ida: Reduce IRQ held duration

Its bad manners to hold IRQs disabled over a full cpumask iteration.
Change it so that it disables the IRQs only where strictly required to
avoid lock inversion issues.

Signed-off-by: Peter Zijlstra <pet...@infradead.org>
---
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -342,33 +342,31 @@ EXPORT_SYMBOL_GPL(__percpu_ida_init);
int percpu_ida_for_each_free(struct percpu_ida *pool, percpu_ida_cb fn,
void *data)
{
- unsigned long flags;
struct percpu_ida_cpu *remote;
- unsigned cpu, i, err = 0;
+ unsigned long flags;
+ int cpu, i, err = 0;

- local_irq_save(flags);
for_each_possible_cpu(cpu) {
remote = per_cpu_ptr(pool->tag_cpu, cpu);
- spin_lock(&remote->lock);
+ spin_lock_irqsave(&remote->lock, flags);
for (i = 0; i < remote->nr_free; i++) {
err = fn(remote->freelist[i], data);
if (err)
break;
}
- spin_unlock(&remote->lock);
+ spin_unlock_irqrestore(&remote->lock, flags);
if (err)
- goto out;
+ return err;
}

- spin_lock(&pool->lock);
+ spin_lock_irqsave(&pool->lock, flags);
for (i = 0; i < pool->nr_free; i++) {
err = fn(pool->freelist[i], data);
if (err)
break;
}
- spin_unlock(&pool->lock);
-out:
- local_irq_restore(flags);
+ spin_unlock_irqrestore(&pool->lock, flags);
+
return err;
}
EXPORT_SYMBOL_GPL(percpu_ida_for_each_free);

Peter Zijlstra

unread,

Jan 23, 2014, 11:30:02 AM1/23/14

to

On Thu, Jan 23, 2014 at 05:28:29AM -0800, Kent Overstreet wrote:
> pool->lock is also going to be fairly badly contended in the worst case,
> and that can get real bad real fast... now that I think about it we
> probably want to avoid the __alloc_global_tag() double call just because
> of that, pool->lock is going to be quite a bit more contended than the
> waitlist lock just because fo the amount of work done under it.

On top of the two previous; I think we can reduce pool->lock contention
by not holding it while doing steal_tags().

By dropping pool->lock around steal_tags() we loose serialization over:

pool->cpus_have_tags is an atomic bitmask, and
pool->cpu_last_stolem, that's a heuristic anyway, so sod it.

We further loose the guarantee relied upon by percpu_ida_free(), so have
it also acquire the tags->lock, which should be a far less contended
resource.

Now everything modifying percpu_ida_cpu state holds
percpu_ida_cpu::lock, everything that modifies the actual percpu_ida
freelists holds percpu_ida::lock, and percpu_ida_cpu::lock nests inside
percpu_ida::lock.

The only annoying thing is that we're still holding IRQs over
steal_tags(), we should be able to make that a preempt_disable() without
too much effort, or very much cheat and drop even that and rely on the
percpu_ida_cpu::lock to serialize everything and just hope that we don't
migrate too often.

But that's for another patch.

---
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c

@@ -68,8 +68,6 @@ static inline void steal_tags(struct per

unsigned cpus_have_tags, cpu = pool->cpu_last_stolen;
struct percpu_ida_cpu *remote;

- lockdep_assert_held(&pool->lock);
-

for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags);
cpus_have_tags * pool->percpu_max_size > pool->nr_tags / 2;
cpus_have_tags--) {

@@ -141,18 +139,24 @@ static inline int alloc_global_tag(struc
min(pool->nr_free, pool->percpu_batch_size));
}

+ spin_unlock(&pool->lock);
+
if (!tags->nr_free)
steal_tags(pool, tags);

if (tags->nr_free) {
- tag = tags->freelist[--tags->nr_free];

+ spin_lock(&tags->lock);

if (tags->nr_free) {
- cpumask_set_cpu(smp_processor_id(),
- &pool->cpus_have_tags);

+ tag = tags->freelist[--tags->nr_free];

+ if (tags->nr_free) {
+ cpumask_set_cpu(smp_processor_id(),
+ &pool->cpus_have_tags);
+ }
}

+ spin_unlock(&tags->lock);
}

- spin_unlock_irqrestore(&pool->lock, flags);
+ local_irq_restore(flags);

return tag;
}
@@ -238,12 +242,8 @@ void percpu_ida_free(struct percpu_ida *

if (nr_free == pool->percpu_max_size) {
spin_lock(&pool->lock);

+ spin_lock(&tags->lock);

- /*
- * Global lock held and irqs disabled, don't need percpu lock
- * because everybody accessing remote @tags will hold
- * pool->lock -- steal_tags().
- */

if (tags->nr_free == pool->percpu_max_size) {
move_tags(pool->freelist, &pool->nr_free,

tags->freelist, &tags->nr_free,
@@ -251,6 +251,8 @@ void percpu_ida_free(struct percpu_ida *

wake_up(&pool->wait);
}
+
+ spin_unlock(&tags->lock);
spin_unlock(&pool->lock);

Peter Zijlstra

unread,

Jan 23, 2014, 11:50:02 AM1/23/14

to

On Thu, Jan 23, 2014 at 05:22:54PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 05:28:29AM -0800, Kent Overstreet wrote:
> > pool->lock is also going to be fairly badly contended in the worst case,
> > and that can get real bad real fast... now that I think about it we
> > probably want to avoid the __alloc_global_tag() double call just because
> > of that, pool->lock is going to be quite a bit more contended than the
> > waitlist lock just because fo the amount of work done under it.

> Now everything modifying percpu_ida_cpu state holds
> percpu_ida_cpu::lock

Almost that.

---
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c

@@ -51,6 +51,15 @@ static inline void move_tags(unsigned *d
*dst_nr += nr;
}

+static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+ if (l1 > l2)
+ swap(l1, l2);
+
+ spin_lock(l1);
+ spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
/*
* Try to steal tags from a remote cpu's percpu freelist.
*
@@ -87,7 +96,7 @@ static inline void steal_tags(struct per
if (remote == tags)
continue;

- spin_lock(&remote->lock);
+ double_lock(&tags->lock, &remote->lock);

if (remote->nr_free) {
memcpy(tags->freelist,
@@ -99,6 +108,7 @@ static inline void steal_tags(struct per
}

spin_unlock(&remote->lock);
+ spin_unlock(&tags->lock);

if (tags->nr_free)
break;

Nicholas A. Bellinger

unread,

Jan 23, 2014, 1:40:01 PM1/23/14

to

On Wed, 2014-01-22 at 11:53 -0800, Nicholas A. Bellinger wrote:
> Hi Peter,
>
> Does this satisfy your questions..?
>
> Do you have any more concerns about TASK_RUNNING + prepare_to_wait()
> usage in percpu_ida_alloc() that need to be addressed before I can drop
> this series into target-pending/for-next to address the original bug..?
>

Given the silence, I'll assume your OK with the initial TASK_RUNNING +
prepare_to_wait() bit, right..?

--nab

Peter Zijlstra

unread,

Jan 23, 2014, 2:20:03 PM1/23/14

to

On Thu, Jan 23, 2014 at 10:40:04AM -0800, Nicholas A. Bellinger wrote:
> On Wed, 2014-01-22 at 11:53 -0800, Nicholas A. Bellinger wrote:
> > Hi Peter,
> >
> > Does this satisfy your questions..?
> >
> > Do you have any more concerns about TASK_RUNNING + prepare_to_wait()
> > usage in percpu_ida_alloc() that need to be addressed before I can drop
> > this series into target-pending/for-next to address the original bug..?
> >
>
> Given the silence,

You mean the silence in which I send a 4+ emails earlier today?

> I'll assume your OK with the initial TASK_RUNNING +
> prepare_to_wait() bit, right..?

No, I would prefer not to do that. While it does work its awkward at
best.

Nicholas A. Bellinger

unread,

Jan 23, 2014, 2:30:02 PM1/23/14

to

On Thu, 2014-01-23 at 20:12 +0100, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 10:40:04AM -0800, Nicholas A. Bellinger wrote:
> > On Wed, 2014-01-22 at 11:53 -0800, Nicholas A. Bellinger wrote:
> > > Hi Peter,
> > >
> > > Does this satisfy your questions..?
> > >
> > > Do you have any more concerns about TASK_RUNNING + prepare_to_wait()
> > > usage in percpu_ida_alloc() that need to be addressed before I can drop
> > > this series into target-pending/for-next to address the original bug..?
> > >
> >
> > Given the silence,
>
> You mean the silence in which I send a 4+ emails earlier today?
>
> > I'll assume your OK with the initial TASK_RUNNING +
> > prepare_to_wait() bit, right..?
>
> No, I would prefer not to do that. While it does work its awkward at
> best.

AFAICT, those changes don't address the original bug that the series was
trying to address, allowing the percpu_ida_alloc() tag stealing slow
path to be interrupted by a signal..

Also, keep in mind this change needs to be backported to >= v3.12, which
is why the percpu_ida changes have been kept to a minimum.

--nab

Kent Overstreet

unread,

Jan 23, 2014, 2:40:01 PM1/23/14

to

On Thu, Jan 23, 2014 at 08:12:29PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 10:40:04AM -0800, Nicholas A. Bellinger wrote:
> > On Wed, 2014-01-22 at 11:53 -0800, Nicholas A. Bellinger wrote:
> > > Hi Peter,
> > >
> > > Does this satisfy your questions..?
> > >
> > > Do you have any more concerns about TASK_RUNNING + prepare_to_wait()
> > > usage in percpu_ida_alloc() that need to be addressed before I can drop
> > > this series into target-pending/for-next to address the original bug..?
> > >
> >
> > Given the silence,
>
> You mean the silence in which I send a 4+ emails earlier today?
>
> > I'll assume your OK with the initial TASK_RUNNING +
> > prepare_to_wait() bit, right..?
>
> No, I would prefer not to do that. While it does work its awkward at
> best.

I do like the improvements, but personally, I really don't see anything wrong
with the initial patch and for backporting that really is what we should do -
this code _is_ subtle enough backporting our (your) improvements is not
something I'd want to do, having debugged this code when I first wrote it...

Kent Overstreet

unread,

Jan 23, 2014, 2:40:02 PM1/23/14

to

On Thu, Jan 23, 2014 at 05:22:54PM +0100, Peter Zijlstra wrote:

> On Thu, Jan 23, 2014 at 05:28:29AM -0800, Kent Overstreet wrote:
> > pool->lock is also going to be fairly badly contended in the worst case,
> > and that can get real bad real fast... now that I think about it we
> > probably want to avoid the __alloc_global_tag() double call just because
> > of that, pool->lock is going to be quite a bit more contended than the
> > waitlist lock just because fo the amount of work done under it.
>
> On top of the two previous; I think we can reduce pool->lock contention
> by not holding it while doing steal_tags().
>
> By dropping pool->lock around steal_tags() we loose serialization over:
>
> pool->cpus_have_tags is an atomic bitmask, and
> pool->cpu_last_stolem, that's a heuristic anyway, so sod it.
>
> We further loose the guarantee relied upon by percpu_ida_free(), so have
> it also acquire the tags->lock, which should be a far less contended
> resource.
>
> Now everything modifying percpu_ida_cpu state holds
> percpu_ida_cpu::lock, everything that modifies the actual percpu_ida
> freelists holds percpu_ida::lock, and percpu_ida_cpu::lock nests inside
> percpu_ida::lock.

I find myself wondering why I didn't originally do this - I'm a bit worried
there was some subtle race I've forgotten about with this approach - but
probably I just got tired of trying out and trying to reason about different
subtle optimizations is all :)

So yeah, assuming we can't find anything wrong with this approach - this should
be great.

> The only annoying thing is that we're still holding IRQs over
> steal_tags(), we should be able to make that a preempt_disable() without
> too much effort, or very much cheat and drop even that and rely on the
> percpu_ida_cpu::lock to serialize everything and just hope that we don't
> migrate too often.

I don't think that's an issue - it's pretty hard to come up with a scenario
where most/all of a large number of cpus are marked in the bit vector as having
tags, and _then_ none of them actually have tags (because as soon as one of them
does, we'll succeed and break out of the loop).

And then that would only be able to happen once, because we clear bits as we go.

And we need interrupts disabled while we're holding any of the spinlocks (as
freeing can certainly happen in atomic context), so the only alternative would
be to save/restore interrupts on every loop iteration, and that'll get _real_
expensive fast. (last I profiled it nested push/pop of the flags register was
_ridiculous_, sigh)

Nicholas A. Bellinger

unread,

Jan 23, 2014, 2:40:02 PM1/23/14

to

On Thu, 2014-01-23 at 11:31 -0800, Nicholas A. Bellinger wrote:
> On Thu, 2014-01-23 at 20:12 +0100, Peter Zijlstra wrote:
> > On Thu, Jan 23, 2014 at 10:40:04AM -0800, Nicholas A. Bellinger wrote:
> > > On Wed, 2014-01-22 at 11:53 -0800, Nicholas A. Bellinger wrote:
> > > > Hi Peter,
> > > >
> > > > Does this satisfy your questions..?
> > > >
> > > > Do you have any more concerns about TASK_RUNNING + prepare_to_wait()
> > > > usage in percpu_ida_alloc() that need to be addressed before I can drop
> > > > this series into target-pending/for-next to address the original bug..?
> > > >
> > >
> > > Given the silence,
> >
> > You mean the silence in which I send a 4+ emails earlier today?
> >
> > > I'll assume your OK with the initial TASK_RUNNING +
> > > prepare_to_wait() bit, right..?
> >
> > No, I would prefer not to do that. While it does work its awkward at
> > best.
>
> AFAICT, those changes don't address the original bug that the series was
> trying to address, allowing the percpu_ida_alloc() tag stealing slow
> path to be interrupted by a signal..
>
> Also, keep in mind this change needs to be backported to >= v3.12, which
> is why the percpu_ida changes have been kept to a minimum.
>

<A little too eager to SEND>

So would you prefer the following addition to the original bugfix
instead..?

diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index a48ce2e..58b6714 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -174,7 +174,8 @@ int percpu_ida_alloc(struct percpu_ida *pool, int state)

*
* global lock held and irqs disabled, don't need percpu lock
*/

- prepare_to_wait(&pool->wait, &wait, state);

+ if (state != TASK_RUNNING)

+ prepare_to_wait(&pool->wait, &wait, state);

if (!tags->nr_free)
alloc_global_tags(pool, tags);

@@ -199,8 +200,9 @@ int percpu_ida_alloc(struct percpu_ida *pool, int state)
local_irq_save(flags);
tags = this_cpu_ptr(pool->tag_cpu);
}
+ if (state != TASK_RUNNING)
+ finish_wait(&pool->wait, &wait);

- finish_wait(&pool->wait, &wait);
return tag;
}
EXPORT_SYMBOL_GPL(percpu_ida_alloc);

Peter Zijlstra

unread,

Jan 24, 2014, 10:20:02 AM1/24/14

to

On Thu, Jan 23, 2014 at 11:38:24AM -0800, Nicholas A. Bellinger wrote:
> > AFAICT, those changes don't address the original bug that the series was
> > trying to address, allowing the percpu_ida_alloc() tag stealing slow
> > path to be interrupted by a signal..
> >
> > Also, keep in mind this change needs to be backported to >= v3.12, which
> > is why the percpu_ida changes have been kept to a minimum.

Well, the other option is to revert whatever caused the issue in the
first place :-)

I'm not much for making ugly fixes just because its easier to backport.

> <A little too eager to SEND>
>
> So would you prefer the following addition to the original bugfix
> instead..?

I'll make a right old mess out of percpu_ida.c, but yeah.

Nicholas A. Bellinger

unread,

Jan 25, 2014, 1:40:01 AM1/25/14

to

On Fri, 2014-01-24 at 16:14 +0100, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 11:38:24AM -0800, Nicholas A. Bellinger wrote:
> > > AFAICT, those changes don't address the original bug that the series was
> > > trying to address, allowing the percpu_ida_alloc() tag stealing slow
> > > path to be interrupted by a signal..
> > >
> > > Also, keep in mind this change needs to be backported to >= v3.12, which
> > > is why the percpu_ida changes have been kept to a minimum.
>
> Well, the other option is to revert whatever caused the issue in the
> first place :-)
>
> I'm not much for making ugly fixes just because its easier to backport.
>

Me either, but given Kent's comment on subtle issues for larger
improvements in the tag stealing slow path, it's the right approach for
addressing a bug in stable code.

Also, I'd also prefer to avoid introducing new issues for blk-mq in the
first round of stable code, give the amount of testing it's already
undergone with percpu_ida for v3.13.

> > <A little too eager to SEND>
> >
> > So would you prefer the following addition to the original bugfix
> > instead..?
>
> I'll make a right old mess out of percpu_ida.c, but yeah.
>

Kent and I would both like to see your improvements merged in upstream,
just not in a manner that would cause Greg-KH to start cursing loudly
for a stable backport.

That said, unless Jens throws a last minute NACK, I'll be queuing patch
#1 + #3 into target-pending/for-next for sunday night's build.

Jens, please feel free to pickup Patch #2 for a post v3.15 merge at your
earliest convenience.

Thanks,

--nab

> > diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
> > index a48ce2e..58b6714 100644
> > --- a/lib/percpu_ida.c
> > +++ b/lib/percpu_ida.c
> > @@ -174,7 +174,8 @@ int percpu_ida_alloc(struct percpu_ida *pool, int state)
> > *
> > * global lock held and irqs disabled, don't need percpu lock
> > */
> > - prepare_to_wait(&pool->wait, &wait, state);
> > + if (state != TASK_RUNNING)
> > + prepare_to_wait(&pool->wait, &wait, state);
> >
> > if (!tags->nr_free)
> > alloc_global_tags(pool, tags);
> > @@ -199,8 +200,9 @@ int percpu_ida_alloc(struct percpu_ida *pool, int state)
> > local_irq_save(flags);
> > tags = this_cpu_ptr(pool->tag_cpu);
> > }
> > + if (state != TASK_RUNNING)
> > + finish_wait(&pool->wait, &wait);
> >
> > - finish_wait(&pool->wait, &wait);
> > return tag;
> > }
> > EXPORT_SYMBOL_GPL(percpu_ida_alloc);
> >
> >
> --

> To unsubscribe from this list: send the line "unsubscribe target-devel" in

Alexander Gordeev

unread,

Feb 10, 2014, 4:30:03 AM2/10/14

to

On Thu, Jan 23, 2014 at 05:22:54PM +0100, Peter Zijlstra wrote:

Two concurrent threads find and unset the very same bit few lines below

cpu = cpumask_next(cpu, &pool->cpus_have_tags);

[...]

cpumask_clear_cpu(cpu, &pool->cpus_have_tags);

The second's thread unset races with cpumask_set_cpu() in percpu_ida_free()
or alloc_global_tag()

if (nr_free == 1) {
cpumask_set_cpu(smp_processor_id(),
&pool->cpus_have_tags);
wake_up(&pool->wait);
}

Which results in the woken thread does not find the (illegitimately) unset
bit while looping over the mask in steal_tags(). I suspect this or another
thread might enter an unwakable sleep as the number of bits in the mask
and number of threads in the waitqueue became unbalanced.

Hope, I am wrong here.

--
Regards,
Alexander Gordeev
agor...@redhat.com