[PATCHSET] workqueue: concurrency managed workqueue, take#4

Tejun Heo

unread,

Feb 26, 2010, 7:30:05 AM2/26/10

to

Hello, all.

This is the fourth take of cmwq (concurrency managed workqueue)
patchset. It's on top of 60b341b778cc2929df16c0a504c91621b3c6a4ad
(v2.6.33). Git tree is available at

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-cmwq

Quilt series is available at

http://master.kernel.org/~tj/patches/review-cmwq.tar.gz

I tested the fscache changes with nfs + cachefiles and it works well
for me but the workload wasn't enough to put the yielding logic to the
test so it needs to be verified.

Please note that the scheduler patches need description update. I'll
do that when establishing scheduler merge tree.

Depending on how you look at the result, the perf test from the last
take[L] showed no performance regression or insignificant improvement.
I'm quite happy with libata conversion. Async works well with the
backend replaced with cmwq. Fscache conversion is still in progress
but fscache workers are mostly used to issue and wait for IOs and I
think conversion so far shows that with some more impedence matching,
there shouldn't be major issues.

Now with non-reentrant and debugfs support, the whole series add about
800 lines but a lot are for cold-path things like CPU hotplug,
freezing and debugging. Given that further conversions are likely to
simplify other workqueue users and the added capability, I don't think
800 more lines at this point is much.

Unless there still are major objections, I'd really like to go forward
with setting up a stable devel tree. Ingo, do you still have
reservations about setting up a scheduler devel branch for cmwq?

The following patches have been added/updated since the last take[L].

0008-workqueue-change-cancel_work_sync-to-clear-work-data.patch
0013-workqueue-define-masks-for-work-flags-and-conditiona.patch
0028-workqueue-carry-cpu-number-in-work-data-once-executi.patch
0029-workqueue-implement-WQ_NON_REENTRANT.patch
0033-workqueue-add-system_wq-system_long_wq-and-system_nr.patch
0034-workqueue-implement-DEBUGFS-workqueue.patch
0035-workqueue-implement-several-utility-APIs.patch
0038-fscache-convert-object-to-use-workqueue-instead-of-s.patch
0039-fscache-convert-operation-to-use-workqueue-instead-o.patch

* Oleg's 0008-workqueue-change-cancel_work_sync-to-clear-work-data
added. It clears work->data after cancel_work_sync(). cmwq patches
updated accordingly.

* 0013 updated such that WORK_STRUCT_STATIC bit is used iff
CONFIG_DEBUG_OBJECTS_WORK is enabled. This reduces cwq alignment to
64bytes with debug objects disabled.

* 0028-0029 added to implement non-reentrant workqueue. A workqueue
can be made non-reentrant by specifying WQ_NON_REENTRANT on
creation.

When a work starts executing, the data part of work->data is set to
the CPU number so that NRT workqueue can reliably determine where
the work was last on on the next queue. Once the last CPU is known,
the queueing code looks up the busy worker hash and determines
whether the work is still running there in which case the work is
queued on that cpu. As workqueue guarantees non-reentrance on
single CPU, this extra affining makes it globally non-reentrant.

Delayed queueing path is updated to preserve the CPU number recorded
in wq->data and flush and cancel code paths are updated to first
look up the gcwq for a work rather than cwq which no longer is
available once a work starts executing.

* In 0033, system_single_workqueue replaced with system_nrt_workqueue.

* 0034 adds debugfs support. If CONFIG_WORKQUEUE_DEBUGFS is enabled,
<debugfs>/workqueue lists all workers and works. The output is
pretty similar to that of slow-work debugfs and also has per-wq
custom show method mechanism copied from slow-work.

* 0035 is what used to be 0030-workqueue-implement-work_busy.
work_busy() is extended to check both pending and running states and
other utility functions are added too - workqueue_set_max_active(),
workqueue_congested() and work_cpu().

* fscache conversion patches 0038-0039 updated so that
- non-reentrant workqueues are used instead of single workqueues.
- sysctl knobs added to control max_active.
- object worker yielding mechanism is implemented in fscache proper
using workqueue_congested().
- debug information remains equivalent.

* Other misc tweaks.

This patchset contains the following patches.

0001-sched-consult-online-mask-instead-of-active-in-selec.patch
0002-sched-rename-preempt_notifiers-to-sched_notifiers-an.patch
0003-sched-refactor-try_to_wake_up.patch
0004-sched-implement-__set_cpus_allowed.patch
0005-sched-make-sched_notifiers-unconditional.patch
0006-sched-add-wakeup-sleep-sched_notifiers-and-allow-NUL.patch
0007-sched-implement-try_to_wake_up_local.patch
0008-workqueue-change-cancel_work_sync-to-clear-work-data.patch
0009-acpi-use-queue_work_on-instead-of-binding-workqueue-.patch
0010-stop_machine-reimplement-without-using-workqueue.patch
0011-workqueue-misc-cosmetic-updates.patch
0012-workqueue-merge-feature-parameters-into-flags.patch
0013-workqueue-define-masks-for-work-flags-and-conditiona.patch
0014-workqueue-separate-out-process_one_work.patch
0015-workqueue-temporarily-disable-workqueue-tracing.patch
0016-workqueue-kill-cpu_populated_map.patch
0017-workqueue-update-cwq-alignement.patch
0018-workqueue-reimplement-workqueue-flushing-using-color.patch
0019-workqueue-introduce-worker.patch
0020-workqueue-reimplement-work-flushing-using-linked-wor.patch
0021-workqueue-implement-per-cwq-active-work-limit.patch
0022-workqueue-reimplement-workqueue-freeze-using-max_act.patch
0023-workqueue-introduce-global-cwq-and-unify-cwq-locks.patch
0024-workqueue-implement-worker-states.patch
0025-workqueue-reimplement-CPU-hotplugging-support-using-.patch
0026-workqueue-make-single-thread-workqueue-shared-worker.patch
0027-workqueue-add-find_worker_executing_work-and-track-c.patch
0028-workqueue-carry-cpu-number-in-work-data-once-executi.patch
0029-workqueue-implement-WQ_NON_REENTRANT.patch
0030-workqueue-use-shared-worklist-and-pool-all-workers-p.patch
0031-workqueue-implement-concurrency-managed-dynamic-work.patch
0032-workqueue-increase-max_active-of-keventd-and-kill-cu.patch
0033-workqueue-add-system_wq-system_long_wq-and-system_nr.patch
0034-workqueue-implement-DEBUGFS-workqueue.patch
0035-workqueue-implement-several-utility-APIs.patch
0036-libata-take-advantage-of-cmwq-and-remove-concurrency.patch
0037-async-use-workqueue-for-worker-pool.patch
0038-fscache-convert-object-to-use-workqueue-instead-of-s.patch
0039-fscache-convert-operation-to-use-workqueue-instead-o.patch
0040-fscache-drop-references-to-slow-work.patch
0041-cifs-use-workqueue-instead-of-slow-work.patch
0042-gfs2-use-workqueue-instead-of-slow-work.patch
0043-slow-work-kill-it.patch

diffstat follows.

Thanks.

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel/939353
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Tejun Heo

unread,

Feb 26, 2010, 7:30:05 AM2/26/10

to

Work flags are about to see more traditional mask handling. Define
WORK_STRUCT_*_BIT as the bit position constant and redefine
WORK_STRUCT_* as bit masks. Also, make WORK_STRUCT_STATIC_* flags
conditional

While at it, re-define these constants as enums and use
WORK_STRUCT_STATIC instead of hard-coding 2 in
WORK_DATA_STATIC_INIT().

Signed-off-by: Tejun Heo <t...@kernel.org>
---
include/linux/workqueue.h | 29 +++++++++++++++++++++--------
kernel/workqueue.c | 12 ++++++------
2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index d89cfc1..d60c570 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -22,12 +22,25 @@ typedef void (*work_func_t)(struct work_struct *work);
*/
#define work_data_bits(work) ((unsigned long *)(&(work)->data))

+enum {
+ WORK_STRUCT_PENDING_BIT = 0, /* work item is pending execution */
+#ifdef CONFIG_DEBUG_OBJECTS_WORK
+ WORK_STRUCT_STATIC_BIT = 1, /* static initializer (debugobjects) */
+#endif
+
+ WORK_STRUCT_PENDING = 1 << WORK_STRUCT_PENDING_BIT,
+#ifdef CONFIG_DEBUG_OBJECTS_WORK
+ WORK_STRUCT_STATIC = 1 << WORK_STRUCT_STATIC_BIT,
+#else
+ WORK_STRUCT_STATIC = 0,
+#endif
+
+ WORK_STRUCT_FLAG_MASK = 3UL,
+ WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
+};
+
struct work_struct {
atomic_long_t data;
-#define WORK_STRUCT_PENDING 0 /* T if work item pending execution */
-#define WORK_STRUCT_STATIC 1 /* static initializer (debugobjects) */
-#define WORK_STRUCT_FLAG_MASK (3UL)
-#define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
struct list_head entry;
work_func_t func;
#ifdef CONFIG_LOCKDEP
@@ -36,7 +49,7 @@ struct work_struct {
};

#define WORK_DATA_INIT() ATOMIC_LONG_INIT(0)
-#define WORK_DATA_STATIC_INIT() ATOMIC_LONG_INIT(2)
+#define WORK_DATA_STATIC_INIT() ATOMIC_LONG_INIT(WORK_STRUCT_STATIC)

struct delayed_work {
struct work_struct work;
@@ -98,7 +111,7 @@ extern void __init_work(struct work_struct *work, int onstack);
extern void destroy_work_on_stack(struct work_struct *work);
static inline unsigned int work_static(struct work_struct *work)
{
- return *work_data_bits(work) & (1 << WORK_STRUCT_STATIC);
+ return *work_data_bits(work) & WORK_STRUCT_STATIC;
}
#else
static inline void __init_work(struct work_struct *work, int onstack) { }
@@ -167,7 +180,7 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
* @work: The work item in question
*/
#define work_pending(work) \
- test_bit(WORK_STRUCT_PENDING, work_data_bits(work))
+ test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))

/**
* delayed_work_pending - Find out whether a delayable work item is currently
@@ -182,7 +195,7 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
* @work: The work item in question
*/
#define work_clear_pending(work) \
- clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))
+ clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))

enum {
WQ_FREEZEABLE = 1 << 0, /* freeze during suspend */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 79fd183..c73d5e3 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -115,7 +115,7 @@ static int work_fixup_activate(void *addr, enum debug_obj_state state)
* statically initialized. We just make sure that it
* is tracked in the object tracker.
*/
- if (test_bit(WORK_STRUCT_STATIC, work_data_bits(work))) {
+ if (test_bit(WORK_STRUCT_STATIC_BIT, work_data_bits(work))) {
debug_object_init(work, &work_debug_descr);
debug_object_activate(work, &work_debug_descr);
return 0;
@@ -232,7 +232,7 @@ static inline void set_wq_data(struct work_struct *work,
BUG_ON(!work_pending(work));

atomic_long_set(&work->data, (unsigned long)cwq | work_static(work) |
- (1UL << WORK_STRUCT_PENDING) | extra_flags);
+ WORK_STRUCT_PENDING | extra_flags);
}

/*
@@ -330,7 +330,7 @@ queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
{
int ret = 0;

- if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
+ if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
__queue_work(cpu, wq, work);
ret = 1;
}
@@ -380,7 +380,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
struct timer_list *timer = &dwork->timer;
struct work_struct *work = &dwork->work;

- if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
+ if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
BUG_ON(timer_pending(timer));
BUG_ON(!list_empty(&work->entry));

@@ -516,7 +516,7 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
* might deadlock.
*/
INIT_WORK_ON_STACK(&barr->work, wq_barrier_func);
- __set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work));
+ __set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));
init_completion(&barr->done);

debug_work_activate(&barr->work);
@@ -628,7 +628,7 @@ static int try_to_grab_pending(struct work_struct *work)
struct cpu_workqueue_struct *cwq;
int ret = -1;

- if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work)))
+ if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
return 0;

/*
--
1.6.4.2

Tejun Heo

unread,

Feb 26, 2010, 7:30:04 AM2/26/10

to

Now that all the workers are tracked by gcwq, we can find which worker
is executing a work from gcwq. Implement find_worker_executing_work()
and make worker track its current_cwq so that we can find things the
other way around. This will be used to implement non-reentrant wqs.

Signed-off-by: Tejun Heo <t...@kernel.org>
---

kernel/workqueue.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9993055..d1a7aaf 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -82,6 +82,7 @@ struct worker {
};

struct work_struct *current_work; /* L: work being processed */
+ struct cpu_workqueue_struct *current_cwq; /* L: current_work's cwq */
struct list_head scheduled; /* L: scheduled works */
struct task_struct *task; /* I: worker task */
struct global_cwq *gcwq; /* I: the associated gcwq */
@@ -373,6 +374,59 @@ static struct hlist_head *busy_worker_head(struct global_cwq *gcwq,
}

/**
+ * __find_worker_executing_work - find worker which is executing a work
+ * @gcwq: gcwq of interest
+ * @bwh: hash head as returned by busy_worker_head()
+ * @work: work to find worker for
+ *
+ * Find a worker which is executing @work on @gcwq. @bwh should be
+ * the hash head obtained by calling busy_worker_head() with the same
+ * work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to worker which is executing @work if found, NULL
+ * otherwise.
+ */
+static struct worker *__find_worker_executing_work(struct global_cwq *gcwq,
+ struct hlist_head *bwh,
+ struct work_struct *work)
+{
+ struct worker *worker;
+ struct hlist_node *tmp;
+
+ hlist_for_each_entry(worker, tmp, bwh, hentry)
+ if (worker->current_work == work)
+ return worker;
+ return NULL;
+}
+
+/**
+ * find_worker_executing_work - find worker which is executing a work
+ * @gcwq: gcwq of interest
+ * @work: work to find worker for
+ *
+ * Find a worker which is executing @work on @gcwq. This function is
+ * identical to __find_worker_executing_work() except that this
+ * function calculates @bwh itself.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to worker which is executing @work if found, NULL
+ * otherwise.
+ */
+static struct worker *find_worker_executing_work(struct global_cwq *gcwq,
+ struct work_struct *work)
+{
+ return __find_worker_executing_work(gcwq, busy_worker_head(gcwq, work),
+ work);
+}
+
+/**
* insert_work - insert a work into cwq
* @cwq: cwq @work belongs to
* @work: work to insert
@@ -915,6 +969,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
debug_work_deactivate(work);
hlist_add_head(&worker->hentry, bwh);
worker->current_work = work;
+ worker->current_cwq = cwq;
work_color = work_flags_to_color(*work_data_bits(work));
list_del_init(&work->entry);

@@ -943,6 +998,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
/* we're done with it, release */
hlist_del_init(&worker->hentry);
worker->current_work = NULL;
+ worker->current_cwq = NULL;
cwq_dec_nr_in_flight(cwq, work_color);
}

--
1.6.4.2

Tejun Heo

unread,

Feb 26, 2010, 7:30:04 AM2/26/10

to

Implement worker states. After created, a worker is STARTED. While a
worker isn't processing a work, it's IDLE and chained on
gcwq->idle_list. While processing a work, a worker is BUSY and
chained on gcwq->busy_hash. Also, gcwq now counts the number of all
workers and idle ones.

worker_thread() is restructured to reflect state transitions.
cwq->more_work is removed and waking up a worker makes it check for
events. A worker is killed by setting DIE flag while it's IDLE and
waking it up.

This gives gcwq better visibility of what's going on and allows it to
find out whether a work is executing quickly which is necessary to
have multiple workers processing the same cwq.

Signed-off-by: Tejun Heo <t...@kernel.org>
---

kernel/workqueue.c | 207 ++++++++++++++++++++++++++++++++++++++++++---------
1 files changed, 170 insertions(+), 37 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0b00722..fe1f3a8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -35,6 +35,17 @@
#include <linux/lockdep.h>
#include <linux/idr.h>

+enum {
+ /* worker flags */
+ WORKER_STARTED = 1 << 0, /* started */
+ WORKER_DIE = 1 << 1, /* die die die */
+ WORKER_IDLE = 1 << 2, /* is idle */
+
+ BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */
+ BUSY_WORKER_HASH_SIZE = 1 << BUSY_WORKER_HASH_ORDER,
+ BUSY_WORKER_HASH_MASK = BUSY_WORKER_HASH_SIZE - 1,
+};
+
/*
* Structure fields follow one of the following exclusion rules.
*
@@ -51,11 +62,18 @@ struct global_cwq;
struct cpu_workqueue_struct;

struct worker {
+ /* on idle list while idle, on busy hash table while busy */
+ union {
+ struct list_head entry; /* L: while idle */
+ struct hlist_node hentry; /* L: while busy */
+ };
+

struct work_struct *current_work; /* L: work being processed */

struct list_head scheduled; /* L: scheduled works */
struct task_struct *task; /* I: worker task */
struct global_cwq *gcwq; /* I: the associated gcwq */

struct cpu_workqueue_struct *cwq; /* I: the associated cwq */
+ unsigned int flags; /* L: flags */
int id; /* I: worker id */
};

@@ -65,6 +83,15 @@ struct worker {
struct global_cwq {
spinlock_t lock; /* the gcwq lock */
unsigned int cpu; /* I: the associated cpu */
+
+ int nr_workers; /* L: total number of workers */
+ int nr_idle; /* L: currently idle ones */
+
+ /* workers are chained either in the idle_list or busy_hash */
+ struct list_head idle_list; /* L: list of idle workers */
+ struct hlist_head busy_hash[BUSY_WORKER_HASH_SIZE];
+ /* L: hash of busy workers */
+
struct ida worker_ida; /* L: for worker IDs */
} ____cacheline_aligned_in_smp;

@@ -77,7 +104,6 @@ struct global_cwq {
struct cpu_workqueue_struct {

struct global_cwq *gcwq; /* I: the associated gcwq */

struct list_head worklist;
- wait_queue_head_t more_work;
struct worker *worker;
struct workqueue_struct *wq; /* I: the owning workqueue */
int work_color; /* L: current color */
@@ -307,6 +333,33 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
}

/**
+ * busy_worker_head - return the busy hash head for a work

+ * @gcwq: gcwq of interest

+ * @work: work to be hashed
+ *
+ * Return hash head of @gcwq for @work.

+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:

+ * Pointer to the hash head.
+ */
+static struct hlist_head *busy_worker_head(struct global_cwq *gcwq,

+ struct work_struct *work)
+{

+ const int base_shift = ilog2(sizeof(struct work_struct));
+ unsigned long v = (unsigned long)work;
+
+ /* simple shift and fold hash, do we need something better? */
+ v >>= base_shift;
+ v += v >> BUSY_WORKER_HASH_ORDER;
+ v &= BUSY_WORKER_HASH_MASK;
+
+ return &gcwq->busy_hash[v];

+}
+
+/**
* insert_work - insert a work into cwq
* @cwq: cwq @work belongs to
* @work: work to insert

@@ -332,7 +385,7 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
smp_wmb();

list_add_tail(&work->entry, head);
- wake_up(&cwq->more_work);
+ wake_up_process(cwq->worker->task);
}

static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
@@ -470,13 +523,59 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
}
EXPORT_SYMBOL_GPL(queue_delayed_work_on);

+/**
+ * worker_enter_idle - enter idle state
+ * @worker: worker which is entering idle state
+ *
+ * @worker is entering idle state. Update stats and idle timer if
+ * necessary.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void worker_enter_idle(struct worker *worker)
+{
+ struct global_cwq *gcwq = worker->gcwq;
+
+ BUG_ON(worker->flags & WORKER_IDLE);
+ BUG_ON(!list_empty(&worker->entry) &&
+ (worker->hentry.next || worker->hentry.pprev));
+
+ worker->flags |= WORKER_IDLE;
+ gcwq->nr_idle++;
+
+ /* idle_list is LIFO */
+ list_add(&worker->entry, &gcwq->idle_list);
+}
+
+/**
+ * worker_leave_idle - leave idle state
+ * @worker: worker which is leaving idle state
+ *
+ * @worker is leaving idle state. Update stats.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void worker_leave_idle(struct worker *worker)
+{
+ struct global_cwq *gcwq = worker->gcwq;
+
+ BUG_ON(!(worker->flags & WORKER_IDLE));
+ worker->flags &= ~WORKER_IDLE;
+ gcwq->nr_idle--;
+ list_del_init(&worker->entry);
+}
+
static struct worker *alloc_worker(void)
{
struct worker *worker;

worker = kzalloc(sizeof(*worker), GFP_KERNEL);
- if (worker)
+ if (worker) {
+ INIT_LIST_HEAD(&worker->entry);
INIT_LIST_HEAD(&worker->scheduled);
+ }
return worker;
}

@@ -541,13 +640,16 @@ fail:
* start_worker - start a newly created worker
* @worker: worker to start
*
- * Start @worker.
+ * Make the gcwq aware of @worker and start it.
*
* CONTEXT:
* spin_lock_irq(gcwq->lock).
*/
static void start_worker(struct worker *worker)
{
+ worker->flags |= WORKER_STARTED;
+ worker->gcwq->nr_workers++;
+ worker_enter_idle(worker);
wake_up_process(worker->task);
}

@@ -555,7 +657,10 @@ static void start_worker(struct worker *worker)
* destroy_worker - destroy a workqueue worker
* @worker: worker to be destroyed
*
- * Destroy @worker.
+ * Destroy @worker and adjust @gcwq stats accordingly.

+ *
+ * CONTEXT:

+ * spin_lock_irq(gcwq->lock) which is released and regrabbed.
*/
static void destroy_worker(struct worker *worker)
{
@@ -566,12 +671,21 @@ static void destroy_worker(struct worker *worker)
BUG_ON(worker->current_work);
BUG_ON(!list_empty(&worker->scheduled));

+ if (worker->flags & WORKER_STARTED)
+ gcwq->nr_workers--;
+ if (worker->flags & WORKER_IDLE)
+ gcwq->nr_idle--;
+
+ list_del_init(&worker->entry);
+ worker->flags |= WORKER_DIE;
+
+ spin_unlock_irq(&gcwq->lock);
+
kthread_stop(worker->task);
kfree(worker);

spin_lock_irq(&gcwq->lock);
ida_remove(&gcwq->worker_ida, id);
- spin_unlock_irq(&gcwq->lock);
}

/**
@@ -687,6 +801,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
{
struct cpu_workqueue_struct *cwq = worker->cwq;
struct global_cwq *gcwq = cwq->gcwq;
+ struct hlist_head *bwh = busy_worker_head(gcwq, work);
work_func_t f = work->func;
int work_color;
#ifdef CONFIG_LOCKDEP
@@ -701,6 +816,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
#endif
/* claim and process */
debug_work_deactivate(work);
+ hlist_add_head(&worker->hentry, bwh);
worker->current_work = work;

work_color = work_flags_to_color(*work_data_bits(work));
list_del_init(&work->entry);

@@ -728,6 +844,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
spin_lock_irq(&gcwq->lock);

/* we're done with it, release */

+ hlist_del_init(&worker->hentry);
worker->current_work = NULL;
cwq_dec_nr_in_flight(cwq, work_color);
}
@@ -764,42 +881,52 @@ static int worker_thread(void *__worker)
struct worker *worker = __worker;
struct global_cwq *gcwq = worker->gcwq;
struct cpu_workqueue_struct *cwq = worker->cwq;
- DEFINE_WAIT(wait);

- for (;;) {
- prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
- if (!kthread_should_stop() &&
- list_empty(&cwq->worklist))
- schedule();
- finish_wait(&cwq->more_work, &wait);
+woke_up:
+ spin_lock_irq(&gcwq->lock);

- if (kthread_should_stop())
- break;
+ /* DIE can be set only while we're idle, checking here is enough */
+ if (worker->flags & WORKER_DIE) {
+ spin_unlock_irq(&gcwq->lock);
+ return 0;
+ }

- spin_lock_irq(&gcwq->lock);
+ worker_leave_idle(worker);

- while (!list_empty(&cwq->worklist)) {
- struct work_struct *work =
- list_first_entry(&cwq->worklist,
- struct work_struct, entry);
-
- if (likely(!(*work_data_bits(work) &
- WORK_STRUCT_LINKED))) {
- /* optimization path, not strictly necessary */
- process_one_work(worker, work);
- if (unlikely(!list_empty(&worker->scheduled)))
- process_scheduled_works(worker);
- } else {
- move_linked_works(work, &worker->scheduled,
- NULL);
+ /*
+ * ->scheduled list can only be filled while a worker is
+ * preparing to process a work or actually processing it.
+ * Make sure nobody diddled with it while I was sleeping.
+ */
+ BUG_ON(!list_empty(&worker->scheduled));
+
+ while (!list_empty(&cwq->worklist)) {
+ struct work_struct *work =
+ list_first_entry(&cwq->worklist,
+ struct work_struct, entry);
+
+ if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
+ /* optimization path, not strictly necessary */
+ process_one_work(worker, work);
+ if (unlikely(!list_empty(&worker->scheduled)))
process_scheduled_works(worker);
- }
+ } else {
+ move_linked_works(work, &worker->scheduled, NULL);
+ process_scheduled_works(worker);
}
-
- spin_unlock_irq(&gcwq->lock);
}

- return 0;
+ /*
+ * gcwq->lock is held and there's no work to process, sleep.
+ * Workers are woken up only while holding gcwq->lock, so
+ * setting the current state before releasing gcwq->lock is
+ * enough to prevent losing any event.
+ */
+ worker_enter_idle(worker);
+ __set_current_state(TASK_INTERRUPTIBLE);
+ spin_unlock_irq(&gcwq->lock);
+ schedule();
+ goto woke_up;
}

struct wq_barrier {
@@ -1558,7 +1685,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
cwq->max_active = max_active;
INIT_LIST_HEAD(&cwq->worklist);
INIT_LIST_HEAD(&cwq->delayed_works);
- init_waitqueue_head(&cwq->more_work);

if (failed)
continue;
@@ -1609,7 +1735,7 @@ EXPORT_SYMBOL_GPL(__create_workqueue_key);
*/
void destroy_workqueue(struct workqueue_struct *wq)
{
- int cpu;
+ unsigned int cpu;

flush_workqueue(wq);

@@ -1626,8 +1752,10 @@ void destroy_workqueue(struct workqueue_struct *wq)
int i;

if (cwq->worker) {
+ spin_lock_irq(&cwq->gcwq->lock);
destroy_worker(cwq->worker);
cwq->worker = NULL;
+ spin_unlock_irq(&cwq->gcwq->lock);
}

for (i = 0; i < WORK_NR_COLORS; i++)
@@ -1842,7 +1970,7 @@ void thaw_workqueues(void)
cwq->nr_active < cwq->max_active)
cwq_activate_first_delayed(cwq);

- wake_up(&cwq->more_work);
+ wake_up_process(cwq->worker->task);
}

spin_unlock_irq(&gcwq->lock);
@@ -1857,6 +1985,7 @@ out_unlock:
void __init init_workqueues(void)
{
unsigned int cpu;
+ int i;

/*
* cwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
@@ -1876,6 +2005,10 @@ void __init init_workqueues(void)
spin_lock_init(&gcwq->lock);
gcwq->cpu = cpu;

+ INIT_LIST_HEAD(&gcwq->idle_list);
+ for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)
+ INIT_HLIST_HEAD(&gcwq->busy_hash[i]);
+
ida_init(&gcwq->worker_ida);
}

--
1.6.4.2

Tejun Heo

unread,

Feb 26, 2010, 7:30:05 AM2/26/10

to

stop_machine() is the only user of RT workqueue. Reimplement it using
kthreads directly and rip RT support from workqueue. This is in
preparation of concurrency managed workqueue.

Signed-off-by: Tejun Heo <t...@kernel.org>
---

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index baba3a2..2d32e06 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -53,6 +53,11 @@ int stop_machine_create(void);
*/
void stop_machine_destroy(void);

+/**
+ * init_stop_machine: initialize stop_machine during boot
+ */
+void init_stop_machine(void);
+
#else

static inline int stop_machine(int (*fn)(void *), void *data,
@@ -67,6 +72,7 @@ static inline int stop_machine(int (*fn)(void *), void *data,

static inline int stop_machine_create(void) { return 0; }
static inline void stop_machine_destroy(void) { }
+static inline void init_stop_machine(void) { }

#endif /* CONFIG_SMP */
#endif /* _LINUX_STOP_MACHINE */
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 9466e86..0697946 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -181,12 +181,11 @@ static inline void destroy_work_on_stack(struct work_struct *work) { }

extern struct workqueue_struct *
-__create_workqueue_key(const char *name, int singlethread,
- int freezeable, int rt, struct lock_class_key *key,
- const char *lock_name);
+__create_workqueue_key(const char *name, int singlethread, int freezeable,
+ struct lock_class_key *key, const char *lock_name);

#ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, singlethread, freezeable, rt) \
+#define __create_workqueue(name, singlethread, freezeable) \
({ \
static struct lock_class_key __key; \
const char *__lock_name; \
@@ -197,19 +196,18 @@ __create_workqueue_key(const char *name, int singlethread,
__lock_name = #name; \
\
__create_workqueue_key((name), (singlethread), \
- (freezeable), (rt), &__key, \
+ (freezeable), &__key, \
__lock_name); \
})
#else
-#define __create_workqueue(name, singlethread, freezeable, rt) \
- __create_workqueue_key((name), (singlethread), (freezeable), (rt), \
+#define __create_workqueue(name, singlethread, freezeable) \
+ __create_workqueue_key((name), (singlethread), (freezeable), \
NULL, NULL)
#endif

-#define create_workqueue(name) __create_workqueue((name), 0, 0, 0)
-#define create_rt_workqueue(name) __create_workqueue((name), 0, 0, 1)
-#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1, 0)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0, 0)
+#define create_workqueue(name) __create_workqueue((name), 0, 0)
+#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1)
+#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)

extern void destroy_workqueue(struct workqueue_struct *wq);

diff --git a/init/main.c b/init/main.c
index 4cb47a1..8cf7543 100644
--- a/init/main.c
+++ b/init/main.c
@@ -34,6 +34,7 @@
#include <linux/security.h>
#include <linux/smp.h>
#include <linux/workqueue.h>
+#include <linux/stop_machine.h>
#include <linux/profile.h>
#include <linux/rcupdate.h>
#include <linux/moduleparam.h>
@@ -769,6 +770,7 @@ static void __init do_initcalls(void)
static void __init do_basic_setup(void)
{
init_workqueues();
+ init_stop_machine();
cpuset_init_smp();
usermodehelper_init();
init_tmpfs();
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 912823e..671a4ac 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -25,6 +25,8 @@ enum stopmachine_state {
STOPMACHINE_RUN,
/* Exit */
STOPMACHINE_EXIT,
+ /* Done */
+ STOPMACHINE_DONE,
};
static enum stopmachine_state state;

@@ -42,10 +44,9 @@ static DEFINE_MUTEX(lock);
static DEFINE_MUTEX(setup_lock);
/* Users of stop_machine. */
static int refcount;
-static struct workqueue_struct *stop_machine_wq;
+static struct task_struct **stop_machine_threads;
static struct stop_machine_data active, idle;
static const struct cpumask *active_cpus;
-static void *stop_machine_work;

static void set_state(enum stopmachine_state newstate)
{
@@ -63,14 +64,31 @@ static void ack_state(void)
}

/* This is the actual function which stops the CPU. It runs
- * in the context of a dedicated stopmachine workqueue. */
-static void stop_cpu(struct work_struct *unused)
+ * on dedicated per-cpu kthreads. */
+static int stop_cpu(void *unused)
{
enum stopmachine_state curstate = STOPMACHINE_NONE;
- struct stop_machine_data *smdata = &idle;
+ struct stop_machine_data *smdata;
int cpu = smp_processor_id();
int err;

+repeat:
+ /* Wait for __stop_machine() to initiate */
+ while (true) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ /* <- kthread_stop() and __stop_machine()::smp_wmb() */
+ if (kthread_should_stop()) {
+ __set_current_state(TASK_RUNNING);
+ return 0;
+ }
+ if (state == STOPMACHINE_PREPARE)
+ break;
+ schedule();
+ }
+ smp_rmb(); /* <- __stop_machine()::set_state() */
+
+ /* Okay, let's go */
+ smdata = &idle;
if (!active_cpus) {
if (cpu == cpumask_first(cpu_online_mask))
smdata = &active;
@@ -104,6 +122,7 @@ static void stop_cpu(struct work_struct *unused)
} while (curstate != STOPMACHINE_EXIT);

local_irq_enable();
+ goto repeat;
}

/* Callback for CPUs which aren't supposed to do anything. */
@@ -112,46 +131,122 @@ static int chill(void *unused)
return 0;
}

+static int create_stop_machine_thread(unsigned int cpu)
+{
+ struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+ struct task_struct **pp = per_cpu_ptr(stop_machine_threads, cpu);
+ struct task_struct *p;
+
+ if (*pp)
+ return -EBUSY;
+
+ p = kthread_create(stop_cpu, NULL, "kstop/%u", cpu);
+ if (IS_ERR(p))
+ return PTR_ERR(p);
+
+ sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
+ *pp = p;
+ return 0;
+}
+
+/* Should be called with cpu hotplug disabled and setup_lock held */
+static void kill_stop_machine_threads(void)
+{
+ unsigned int cpu;
+
+ if (!stop_machine_threads)
+ return;
+
+ for_each_online_cpu(cpu) {
+ struct task_struct *p = *per_cpu_ptr(stop_machine_threads, cpu);
+ if (p)
+ kthread_stop(p);
+ }
+ free_percpu(stop_machine_threads);
+ stop_machine_threads = NULL;
+}
+
int stop_machine_create(void)
{
+ unsigned int cpu;
+
+ get_online_cpus();
mutex_lock(&setup_lock);
if (refcount)
goto done;
- stop_machine_wq = create_rt_workqueue("kstop");
- if (!stop_machine_wq)
- goto err_out;
- stop_machine_work = alloc_percpu(struct work_struct);
- if (!stop_machine_work)
+
+ stop_machine_threads = alloc_percpu(struct task_struct *);
+ if (!stop_machine_threads)
goto err_out;
+
+ /*
+ * cpu hotplug is disabled, create only for online cpus,
+ * cpu_callback() will handle cpu hot [un]plugs.
+ */
+ for_each_online_cpu(cpu) {
+ if (create_stop_machine_thread(cpu))
+ goto err_out;
+ kthread_bind(*per_cpu_ptr(stop_machine_threads, cpu), cpu);
+ }
done:
refcount++;
mutex_unlock(&setup_lock);
+ put_online_cpus();
return 0;

err_out:
- if (stop_machine_wq)
- destroy_workqueue(stop_machine_wq);
+ kill_stop_machine_threads();
mutex_unlock(&setup_lock);
+ put_online_cpus();
return -ENOMEM;
}
EXPORT_SYMBOL_GPL(stop_machine_create);

void stop_machine_destroy(void)
{
+ get_online_cpus();
mutex_lock(&setup_lock);
- refcount--;
- if (refcount)
- goto done;
- destroy_workqueue(stop_machine_wq);
- free_percpu(stop_machine_work);
-done:
+ if (!--refcount)
+ kill_stop_machine_threads();
mutex_unlock(&setup_lock);
+ put_online_cpus();
}
EXPORT_SYMBOL_GPL(stop_machine_destroy);

+static int __cpuinit stop_machine_cpu_callback(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (unsigned long)hcpu;
+ struct task_struct **pp = per_cpu_ptr(stop_machine_threads, cpu);
+
+ /* Hotplug exclusion is enough, no need to worry about setup_lock */
+ if (!stop_machine_threads)
+ return NOTIFY_OK;
+
+ switch (action & ~CPU_TASKS_FROZEN) {
+ case CPU_UP_PREPARE:
+ if (create_stop_machine_thread(cpu)) {
+ printk(KERN_ERR "failed to create stop machine "
+ "thread for %u\n", cpu);
+ return NOTIFY_BAD;
+ }
+ break;
+
+ case CPU_ONLINE:
+ kthread_bind(*pp, cpu);
+ break;
+
+ case CPU_UP_CANCELED:
+ case CPU_POST_DEAD:
+ kthread_stop(*pp);
+ *pp = NULL;
+ break;
+ }
+ return NOTIFY_OK;
+}
+
int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
{
- struct work_struct *sm_work;
int i, ret;

/* Set up initial state. */
@@ -164,19 +259,18 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
idle.fn = chill;
idle.data = NULL;

- set_state(STOPMACHINE_PREPARE);
+ set_state(STOPMACHINE_PREPARE); /* -> stop_cpu()::smp_rmb() */
+ smp_wmb(); /* -> stop_cpu()::set_current_state() */

/* Schedule the stop_cpu work on all cpus: hold this CPU so one
* doesn't hit this CPU until we're ready. */
get_cpu();
- for_each_online_cpu(i) {
- sm_work = per_cpu_ptr(stop_machine_work, i);
- INIT_WORK(sm_work, stop_cpu);
- queue_work_on(i, stop_machine_wq, sm_work);
- }
+ for_each_online_cpu(i)
+ wake_up_process(*per_cpu_ptr(stop_machine_threads, i));
/* This will release the thread on our CPU. */
put_cpu();
- flush_workqueue(stop_machine_wq);
+ while (state < STOPMACHINE_DONE)
+ yield();
ret = active.fnret;
mutex_unlock(&lock);
return ret;
@@ -197,3 +291,8 @@ int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
return ret;
}
EXPORT_SYMBOL_GPL(stop_machine);
+
+void __init init_stop_machine(void)
+{
+ hotcpu_notifier(stop_machine_cpu_callback, 0);
+}
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f7a914f..115f30b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -62,7 +62,6 @@ struct workqueue_struct {
const char *name;
int singlethread;
int freezeable; /* Freeze threads during suspend */
- int rt;
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
@@ -923,7 +922,6 @@ init_cpu_workqueue(struct workqueue_struct *wq, int cpu)

static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
{
- struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
struct workqueue_struct *wq = cwq->wq;
const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
struct task_struct *p;
@@ -939,8 +937,6 @@ static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
*/
if (IS_ERR(p))
return PTR_ERR(p);
- if (cwq->wq->rt)
- sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
cwq->thread = p;

trace_workqueue_creation(cwq->thread, cpu);
@@ -962,7 +958,6 @@ static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)

struct workqueue_struct *__create_workqueue_key(const char *name,

int singlethread,
int freezeable,
- int rt,
struct lock_class_key *key,
const char *lock_name)
{
@@ -984,7 +979,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
wq->singlethread = singlethread;
wq->freezeable = freezeable;
- wq->rt = rt;
INIT_LIST_HEAD(&wq->list);

if (singlethread) {
--
1.6.4.2

Tejun Heo

unread,

Feb 26, 2010, 7:30:05 AM2/26/10

to

Add wakeup and sleep notifiers to sched_notifiers and allow omitting
some of the notifiers in the ops table. These will be used by
concurrency managed workqueue.

Signed-off-by: Tejun Heo <t...@kernel.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Ingo Molnar <mi...@elte.hu>
---
include/linux/sched.h | 6 ++++++
kernel/sched.c | 11 ++++++++---
2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4a1e368..401d746 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1231,6 +1231,10 @@ struct sched_notifier;

/**
* sched_notifier_ops - notifiers called for scheduling events
+ * @wakeup: we're waking up
+ * notifier: struct sched_notifier for the task being woken up
+ * @sleep: we're going to bed
+ * notifier: struct sched_notifier for the task sleeping
* @in: we're about to be rescheduled:
* notifier: struct sched_notifier for the task being scheduled
* cpu: cpu we're scheduled on
@@ -1244,6 +1248,8 @@ struct sched_notifier;
* and depended upon by its users.
*/
struct sched_notifier_ops {
+ void (*wakeup)(struct sched_notifier *notifier);
+ void (*sleep)(struct sched_notifier *notifier);
void (*in)(struct sched_notifier *notifier, int cpu);
void (*out)(struct sched_notifier *notifier, struct task_struct *next);
};
diff --git a/kernel/sched.c b/kernel/sched.c
index 8c2dfb3..c371b8f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1439,7 +1439,8 @@ static inline void cpuacct_update_stats(struct task_struct *tsk,
struct hlist_node *__pos; \
\
hlist_for_each_entry(__sn, __pos, &(p)->sched_notifiers, link) \
- __sn->ops->callback(__sn , ##args); \
+ if (__sn->ops->callback) \
+ __sn->ops->callback(__sn , ##args); \
} while (0)

/**
@@ -2437,6 +2438,8 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
rq->idle_stamp = 0;
}
#endif
+ if (success)
+ fire_sched_notifiers(p, wakeup);
}

/**
@@ -5492,10 +5495,12 @@ need_resched_nonpreemptible:
clear_tsk_need_resched(prev);

if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
- if (unlikely(signal_pending_state(prev->state, prev)))
+ if (unlikely(signal_pending_state(prev->state, prev))) {
prev->state = TASK_RUNNING;
- else
+ } else {
+ fire_sched_notifiers(prev, sleep);
deactivate_task(rq, prev, 1);
+ }
switch_count = &prev->nvcsw;
}

--
1.6.4.2

Tejun Heo

unread,

Feb 26, 2010, 7:30:05 AM2/26/10

to

Use gcwq->worklist instead of cwq->worklist and break the strict
association between a cwq and its worker. All works queued on a cpu
are queued on gcwq->worklist and processed by any available worker on
the gcwq.

As there no longer is strict association between a cwq and its worker,
whether a work is executing can now only be determined by calling
[__]find_worker_executing_work().

After this change, the only association between a cwq and its worker
is that a cwq puts a worker into shared worker pool on creation and
kills it on destruction. As all workqueues are still limited to
max_active of one, this means that there are always at least as many
workers as active works and thus there's no danger for deadlock.

The break of strong association between cwqs and workers requires
somewhat clumsy changes to current_is_keventd() and
destroy_workqueue(). Dynamic worker pool management will remove both
clumsy changes. current_is_keventd() won't be necessary at all as the
only reason it exists is to avoid queueing a work from a work which
will be allowed just fine. The clumsy part of destroy_workqueue() is
added because a worker can only be destroyed while idle and there's no
guarantee a worker is idle when its wq is going down. With dynamic
pool management, workers are not associated with workqueues at all and
only idle ones will be submitted to destroy_workqueue() so the code
won't be necessary anymore.

Signed-off-by: Tejun Heo <t...@kernel.org>
---

kernel/workqueue.c | 130 +++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 98 insertions(+), 32 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c150a01..b0311b1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -72,7 +72,6 @@ enum {
*/

struct global_cwq;
-struct cpu_workqueue_struct;

struct worker {

/* on idle list while idle, on busy hash table while busy */

@@ -86,7 +85,6 @@ struct worker {

struct list_head scheduled; /* L: scheduled works */
struct task_struct *task; /* I: worker task */
struct global_cwq *gcwq; /* I: the associated gcwq */

- struct cpu_workqueue_struct *cwq; /* I: the associated cwq */

unsigned int flags; /* L: flags */
int id; /* I: worker id */
};

@@ -96,6 +94,7 @@ struct worker {
*/

struct global_cwq {
spinlock_t lock; /* the gcwq lock */

+ struct list_head worklist; /* L: list of pending works */

unsigned int cpu; /* I: the associated cpu */

unsigned int flags; /* L: GCWQ_* flags */

@@ -121,7 +120,6 @@ struct global_cwq {
*/

struct cpu_workqueue_struct {
struct global_cwq *gcwq; /* I: the associated gcwq */

- struct list_head worklist;

struct worker *worker;
struct workqueue_struct *wq; /* I: the owning workqueue */
int work_color; /* L: current color */

@@ -386,6 +384,32 @@ static struct global_cwq *get_work_gcwq(struct work_struct *work)
return get_gcwq(cpu);
}

+/* Return the first worker. Safe with preemption disabled */
+static struct worker *first_worker(struct global_cwq *gcwq)
+{
+ if (unlikely(list_empty(&gcwq->idle_list)))
+ return NULL;
+
+ return list_first_entry(&gcwq->idle_list, struct worker, entry);
+}
+
+/**
+ * wake_up_worker - wake up an idle worker
+ * @gcwq: gcwq to wake worker for
+ *
+ * Wake up the first idle worker of @gcwq.

+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).

+ */
+static void wake_up_worker(struct global_cwq *gcwq)
+{
+ struct worker *worker = first_worker(gcwq);
+
+ if (likely(worker))
+ wake_up_process(worker->task);
+}
+
/**

* busy_worker_head - return the busy hash head for a work

* @gcwq: gcwq of interest

@@ -467,13 +491,14 @@ static struct worker *find_worker_executing_work(struct global_cwq *gcwq,
}

/**
- * insert_work - insert a work into cwq
+ * insert_work - insert a work into gcwq

* @cwq: cwq @work belongs to
* @work: work to insert

* @head: insertion point
* @extra_flags: extra WORK_STRUCT_* flags to set
*
- * Insert @work into @cwq after @head.
+ * Insert @work which belongs to @cwq into @gcwq after @head.
+ * @extra_flags is or'd to work_struct flags.

*
* CONTEXT:
* spin_lock_irq(gcwq->lock).

@@ -492,7 +517,7 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
smp_wmb();

list_add_tail(&work->entry, head);
- wake_up_process(cwq->worker->task);
+ wake_up_worker(cwq->gcwq);
}

/**
@@ -608,7 +633,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,

if (likely(cwq->nr_active < cwq->max_active)) {
cwq->nr_active++;
- worklist = &cwq->worklist;
+ worklist = &gcwq->worklist;
} else
worklist = &cwq->delayed_works;

@@ -793,10 +818,10 @@ static struct worker *alloc_worker(void)

/**
* create_worker - create a new workqueue worker
- * @cwq: cwq the new worker will belong to
+ * @gcwq: gcwq the new worker will belong to
* @bind: whether to set affinity to @cpu or not
*
- * Create a new worker which is bound to @cwq. The returned worker
+ * Create a new worker which is bound to @gcwq. The returned worker
* can be started by calling start_worker() or destroyed using
* destroy_worker().
*
@@ -806,9 +831,8 @@ static struct worker *alloc_worker(void)
* RETURNS:
* Pointer to the newly created worker.
*/
-static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
+static struct worker *create_worker(struct global_cwq *gcwq, bool bind)
{
- struct global_cwq *gcwq = cwq->gcwq;
int id = -1;
struct worker *worker = NULL;

@@ -826,7 +850,6 @@ static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
goto fail;

worker->gcwq = gcwq;
- worker->cwq = cwq;
worker->id = id;

worker->task = kthread_create(worker_thread, worker, "kworker/%u:%d",
@@ -954,7 +977,7 @@ static void cwq_activate_first_delayed(struct cpu_workqueue_struct *cwq)
struct work_struct *work = list_first_entry(&cwq->delayed_works,
struct work_struct, entry);

- move_linked_works(work, &cwq->worklist, NULL);
+ move_linked_works(work, &cwq->gcwq->worklist, NULL);
cwq->nr_active++;
}

@@ -1022,11 +1045,12 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
*/

static void process_one_work(struct worker *worker, struct work_struct *work)
{

- struct cpu_workqueue_struct *cwq = worker->cwq;
+ struct cpu_workqueue_struct *cwq = get_work_cwq(work);

struct global_cwq *gcwq = cwq->gcwq;

struct hlist_head *bwh = busy_worker_head(gcwq, work);
work_func_t f = work->func;
int work_color;

+ struct worker *collision;
#ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the struct work_struct from
@@ -1037,6 +1061,18 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
*/
struct lockdep_map lockdep_map = work->lockdep_map;
#endif
+ /*
+ * A single work shouldn't be executed concurrently by
+ * multiple workers on a single cpu. Check whether anyone is
+ * already processing the work. If so, defer the work to the
+ * currently executing one.
+ */
+ collision = __find_worker_executing_work(gcwq, bwh, work);
+ if (unlikely(collision)) {
+ move_linked_works(work, &collision->scheduled, NULL);
+ return;
+ }
+

/* claim and process */
debug_work_deactivate(work);

hlist_add_head(&worker->hentry, bwh);
@@ -1044,7 +1080,6 @@ static void process_one_work(struct worker *worker, struct work_struct *work)

worker->current_cwq = cwq;
work_color = work_flags_to_color(*work_data_bits(work));

- BUG_ON(get_work_cwq(work) != cwq);
/* record the current cpu number in the work data and dequeue */
set_work_cpu(work, gcwq->cpu);
list_del_init(&work->entry);
@@ -1108,7 +1143,6 @@ static int worker_thread(void *__worker)

{
struct worker *worker = __worker;
struct global_cwq *gcwq = worker->gcwq;

- struct cpu_workqueue_struct *cwq = worker->cwq;

woke_up:
spin_lock_irq(&gcwq->lock);
@@ -1128,9 +1162,9 @@ woke_up:
*/
BUG_ON(!list_empty(&worker->scheduled));

- while (!list_empty(&cwq->worklist)) {
+ while (!list_empty(&gcwq->worklist)) {

struct work_struct *work =
- list_first_entry(&cwq->worklist,

+ list_first_entry(&gcwq->worklist,
struct work_struct, entry);

if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
@@ -1800,18 +1834,37 @@ int keventd_up(void)

int current_is_keventd(void)
{
- struct cpu_workqueue_struct *cwq;
- int cpu = raw_smp_processor_id(); /* preempt-safe: keventd is per-cpu */
- int ret = 0;
+ bool found = false;
+ unsigned int cpu;

- BUG_ON(!keventd_wq);
+ /*
+ * There no longer is one-to-one relation between worker and
+ * work queue and a worker task might be unbound from its cpu
+ * if the cpu was offlined. Match all busy workers. This
+ * function will go away once dynamic pool is implemented.
+ */
+ for_each_possible_cpu(cpu) {
+ struct global_cwq *gcwq = get_gcwq(cpu);
+ struct worker *worker;
+ struct hlist_node *pos;
+ unsigned long flags;
+ int i;

- cwq = get_cwq(cpu, keventd_wq);
- if (current == cwq->worker->task)
- ret = 1;
+ spin_lock_irqsave(&gcwq->lock, flags);

- return ret;
+ for_each_busy_worker(worker, i, pos, gcwq) {
+ if (worker->task == current) {
+ found = true;
+ break;
+ }
+ }
+
+ spin_unlock_irqrestore(&gcwq->lock, flags);
+ if (found)
+ break;
+ }

+ return found;
}

static struct cpu_workqueue_struct *alloc_cwqs(void)
@@ -1900,12 +1953,11 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
cwq->wq = wq;
cwq->flush_color = -1;
cwq->max_active = max_active;
- INIT_LIST_HEAD(&cwq->worklist);
INIT_LIST_HEAD(&cwq->delayed_works);

if (failed)
continue;
- cwq->worker = create_worker(cwq, cpu_online(cpu));
+ cwq->worker = create_worker(gcwq, cpu_online(cpu));
if (cwq->worker)
start_worker(cwq->worker);
else
@@ -1965,13 +2017,26 @@ void destroy_workqueue(struct workqueue_struct *wq)

for_each_possible_cpu(cpu) {
struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+ struct global_cwq *gcwq = cwq->gcwq;

int i;

if (cwq->worker) {

- spin_lock_irq(&cwq->gcwq->lock);
+ retry:
+ spin_lock_irq(&gcwq->lock);
+ /*
+ * Worker can only be destroyed while idle.
+ * Wait till it becomes idle. This is ugly
+ * and prone to starvation. It will go away
+ * once dynamic worker pool is implemented.
+ */
+ if (!(cwq->worker->flags & WORKER_IDLE)) {
+ spin_unlock_irq(&gcwq->lock);
+ msleep(100);
+ goto retry;
+ }

destroy_worker(cwq->worker);
cwq->worker = NULL;

- spin_unlock_irq(&cwq->gcwq->lock);
+ spin_unlock_irq(&gcwq->lock);

}

for (i = 0; i < WORK_NR_COLORS; i++)

@@ -2290,7 +2355,7 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
*
* Start freezing workqueues. After this function returns, all
* freezeable workqueues will queue new works to their frozen_works
- * list instead of the cwq ones.
+ * list instead of gcwq->worklist.
*
* CONTEXT:
* Grabs and releases workqueue_lock and gcwq->lock's.
@@ -2376,7 +2441,7 @@ out_unlock:
* thaw_workqueues - thaw workqueues
*
* Thaw workqueues. Normal queueing is restored and all collected
- * frozen works are transferred to their respective cwq worklists.
+ * frozen works are transferred to their respective gcwq worklists.
*
* CONTEXT:
* Grabs and releases workqueue_lock and gcwq->lock's.
@@ -2457,6 +2522,7 @@ void __init init_workqueues(void)
struct global_cwq *gcwq = get_gcwq(cpu);

spin_lock_init(&gcwq->lock);
+ INIT_LIST_HEAD(&gcwq->worklist);
gcwq->cpu = cpu;

INIT_LIST_HEAD(&gcwq->idle_list);
--
1.6.4.2

Tejun Heo

unread,

Feb 26, 2010, 7:30:03 AM2/26/10

to

Make the following updates in preparation of concurrency managed
workqueue. None of these changes causes any visible behavior
difference.

* Add comments and adjust indentations to data structures and several
functions.

* Rename wq_per_cpu() to get_cwq() and swap the position of two
parameters for consistency. Convert a direct per_cpu_ptr() access
to wq->cpu_wq to get_cwq().

* Add work_static() and Update set_wq_data() such that it sets the
flags part to WORK_STRUCT_PENDING | WORK_STRUCT_STATIC if static |
@extra_flags.

* Move santiy check on work->entry emptiness from queue_work_on() to
__queue_work() which all queueing paths share.

* Make __queue_work() take @cpu and @wq instead of @cwq.

* Restructure flush_work() and __create_workqueue_key() to make them
easier to modify.

Signed-off-by: Tejun Heo <t...@kernel.org>
---

include/linux/workqueue.h | 5 ++
kernel/workqueue.c | 130 ++++++++++++++++++++++++++++----------------
2 files changed, 88 insertions(+), 47 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 0697946..e724daf 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -96,9 +96,14 @@ struct execute_work {
#ifdef CONFIG_DEBUG_OBJECTS_WORK

extern void __init_work(struct work_struct *work, int onstack);
extern void destroy_work_on_stack(struct work_struct *work);

+static inline unsigned int work_static(struct work_struct *work)
+{
+ return *work_data_bits(work) & (1 << WORK_STRUCT_STATIC);
+}

#else
static inline void __init_work(struct work_struct *work, int onstack) { }

static inline void destroy_work_on_stack(struct work_struct *work) { }

+static inline unsigned int work_static(struct work_struct *work) { return 0; }
#endif

/*
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 115f30b..8506c18 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -37,6 +37,16 @@
#include <trace/events/workqueue.h>

/*
+ * Structure fields follow one of the following exclusion rules.
+ *
+ * I: Set during initialization and read-only afterwards.
+ *
+ * L: cwq->lock protected. Access with cwq->lock held.
+ *
+ * W: workqueue_lock protected.
+ */
+
+/*
* The per-CPU workqueue (if single thread, we always use the first
* possible cpu).
*/
@@ -48,8 +58,8 @@ struct cpu_workqueue_struct {
wait_queue_head_t more_work;
struct work_struct *current_work;

- struct workqueue_struct *wq;
- struct task_struct *thread;
+ struct workqueue_struct *wq; /* I: the owning workqueue */
+ struct task_struct *thread;
} ____cacheline_aligned;

/*
@@ -57,13 +67,13 @@ struct cpu_workqueue_struct {
* per-CPU workqueues:
*/
struct workqueue_struct {
- struct cpu_workqueue_struct *cpu_wq;
- struct list_head list;
- const char *name;
+ struct cpu_workqueue_struct *cpu_wq; /* I: cwq's */
+ struct list_head list; /* W: list of all workqueues */
+ const char *name; /* I: workqueue name */

int singlethread;
int freezeable; /* Freeze threads during suspend */

#ifdef CONFIG_LOCKDEP
- struct lockdep_map lockdep_map;
+ struct lockdep_map lockdep_map;
#endif
};

@@ -204,8 +214,8 @@ static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
? cpu_singlethread_map : cpu_populated_map;
}

-static
-struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
+static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
+ struct workqueue_struct *wq)
{
if (unlikely(is_wq_single_threaded(wq)))
cpu = singlethread_cpu;
@@ -217,15 +227,13 @@ struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
* - Must *only* be called if the pending flag is set
*/

static inline void set_wq_data(struct work_struct *work,

- struct cpu_workqueue_struct *cwq)
+ struct cpu_workqueue_struct *cwq,
+ unsigned long extra_flags)
{
- unsigned long new;
-
BUG_ON(!work_pending(work));

- new = (unsigned long) cwq | (1UL << WORK_STRUCT_PENDING);
- new |= WORK_STRUCT_FLAG_MASK & *work_data_bits(work);
- atomic_long_set(&work->data, new);
+ atomic_long_set(&work->data, (unsigned long)cwq | work_static(work) |
+ (1UL << WORK_STRUCT_PENDING) | extra_flags);
}

/*
@@ -233,9 +241,7 @@ static inline void set_wq_data(struct work_struct *work,
*/
static inline void clear_wq_data(struct work_struct *work)
{
- unsigned long flags = *work_data_bits(work) &
- (1UL << WORK_STRUCT_STATIC);
- atomic_long_set(&work->data, flags);
+ atomic_long_set(&work->data, work_static(work));
}

static inline
@@ -244,29 +250,47 @@ struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
return (void *) (atomic_long_read(&work->data) & WORK_STRUCT_WQ_DATA_MASK);
}

+/**
+ * insert_work - insert a work into cwq
+ * @cwq: cwq @work belongs to
+ * @work: work to insert
+ * @head: insertion point
+ * @extra_flags: extra WORK_STRUCT_* flags to set
+ *
+ * Insert @work into @cwq after @head.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */

static void insert_work(struct cpu_workqueue_struct *cwq,

- struct work_struct *work, struct list_head *head)
+ struct work_struct *work, struct list_head *head,
+ unsigned int extra_flags)
{
trace_workqueue_insertion(cwq->thread, work);

- set_wq_data(work, cwq);
+ /* we own @work, set data and link */
+ set_wq_data(work, cwq, extra_flags);
+
/*
* Ensure that we get the right work->data if we see the
* result of list_add() below, see try_to_grab_pending().
*/
smp_wmb();
+
list_add_tail(&work->entry, head);
wake_up(&cwq->more_work);
}

-static void __queue_work(struct cpu_workqueue_struct *cwq,
+static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
struct work_struct *work)
{
+ struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
unsigned long flags;

debug_work_activate(work);
spin_lock_irqsave(&cwq->lock, flags);
- insert_work(cwq, work, &cwq->worklist);
+ BUG_ON(!list_empty(&work->entry));
+ insert_work(cwq, work, &cwq->worklist, 0);
spin_unlock_irqrestore(&cwq->lock, flags);
}

@@ -308,8 +332,7 @@ queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
int ret = 0;

if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
- BUG_ON(!list_empty(&work->entry));
- __queue_work(wq_per_cpu(wq, cpu), work);
+ __queue_work(cpu, wq, work);
ret = 1;
}
return ret;
@@ -320,9 +343,8 @@ static void delayed_work_timer_fn(unsigned long __data)
{
struct delayed_work *dwork = (struct delayed_work *)__data;
struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work);
- struct workqueue_struct *wq = cwq->wq;

- __queue_work(wq_per_cpu(wq, smp_processor_id()), &dwork->work);
+ __queue_work(smp_processor_id(), cwq->wq, &dwork->work);
}

/**
@@ -366,7 +388,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
timer_stats_timer_set_start_info(&dwork->timer);

/* This stores cwq for the moment, for the timer_fn */
- set_wq_data(work, wq_per_cpu(wq, raw_smp_processor_id()));
+ set_wq_data(work, get_cwq(raw_smp_processor_id(), wq), 0);
timer->expires = jiffies + delay;
timer->data = (unsigned long)dwork;
timer->function = delayed_work_timer_fn;
@@ -430,6 +452,12 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)
spin_unlock_irq(&cwq->lock);
}

+/**
+ * worker_thread - the worker thread function
+ * @__cwq: cwq to serve
+ *
+ * The cwq worker thread function.
+ */
static int worker_thread(void *__cwq)
{
struct cpu_workqueue_struct *cwq = __cwq;
@@ -468,6 +496,17 @@ static void wq_barrier_func(struct work_struct *work)
complete(&barr->done);
}

+/**
+ * insert_wq_barrier - insert a barrier work
+ * @cwq: cwq to insert barrier into
+ * @barr: wq_barrier to insert
+ * @head: insertion point
+ *
+ * Insert barrier @barr into @cwq before @head.

+ *
+ * CONTEXT:

+ * spin_lock_irq(cwq->lock).
+ */

static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,

struct wq_barrier *barr, struct list_head *head)
{
@@ -479,11 +518,10 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
*/
INIT_WORK_ON_STACK(&barr->work, wq_barrier_func);
__set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work));
-
init_completion(&barr->done);

debug_work_activate(&barr->work);
- insert_work(cwq, &barr->work, head);
+ insert_work(cwq, &barr->work, head, 0);
}

static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
@@ -517,9 +555,6 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
*
* We sleep until all works which were queued on entry have been handled,
* but we are not livelocked by new incoming ones.
- *
- * This function used to run the workqueues itself. Now we just wait for the
- * helper threads to do it.
*/
void flush_workqueue(struct workqueue_struct *wq)
{
@@ -558,7 +593,6 @@ int flush_work(struct work_struct *work)
lock_map_acquire(&cwq->wq->lockdep_map);
lock_map_release(&cwq->wq->lockdep_map);

- prev = NULL;
spin_lock_irq(&cwq->lock);
if (!list_empty(&work->entry)) {
/*
@@ -567,22 +601,22 @@ int flush_work(struct work_struct *work)
*/
smp_rmb();
if (unlikely(cwq != get_wq_data(work)))
- goto out;
+ goto already_gone;
prev = &work->entry;
} else {
if (cwq->current_work != work)
- goto out;
+ goto already_gone;
prev = &cwq->worklist;
}
insert_wq_barrier(cwq, &barr, prev->next);
-out:
- spin_unlock_irq(&cwq->lock);
- if (!prev)
- return 0;

+ spin_unlock_irq(&cwq->lock);
wait_for_completion(&barr.done);
destroy_work_on_stack(&barr.work);
return 1;
+already_gone:
+ spin_unlock_irq(&cwq->lock);
+ return 0;
}
EXPORT_SYMBOL_GPL(flush_work);

@@ -665,7 +699,7 @@ static void wait_on_work(struct work_struct *work)
cpu_map = wq_cpu_map(wq);

for_each_cpu(cpu, cpu_map)
- wait_on_cpu_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+ wait_on_cpu_work(get_cwq(cpu, wq), work);
}

static int __cancel_work_timer(struct work_struct *work,
@@ -782,9 +816,7 @@ EXPORT_SYMBOL(schedule_delayed_work);
void flush_delayed_work(struct delayed_work *dwork)
{
if (del_timer_sync(&dwork->timer)) {
- struct cpu_workqueue_struct *cwq;
- cwq = wq_per_cpu(keventd_wq, get_cpu());
- __queue_work(cwq, &dwork->work);
+ __queue_work(get_cpu(), keventd_wq, &dwork->work);
put_cpu();
}
flush_work(&dwork->work);
@@ -967,13 +999,11 @@ struct workqueue_struct *__create_workqueue_key(const char *name,

wq = kzalloc(sizeof(*wq), GFP_KERNEL);
if (!wq)
- return NULL;
+ goto err;

wq->cpu_wq = alloc_percpu(struct cpu_workqueue_struct);
- if (!wq->cpu_wq) {
- kfree(wq);
- return NULL;
- }
+ if (!wq->cpu_wq)
+ goto err;

wq->name = name;

lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);

@@ -1017,6 +1047,12 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
wq = NULL;
}
return wq;
+err:
+ if (wq) {
+ free_percpu(wq->cpu_wq);
+ kfree(wq);
+ }
+ return NULL;
}
EXPORT_SYMBOL_GPL(__create_workqueue_key);

--
1.6.4.2

Anton Blanchard

unread,

Feb 27, 2010, 6:00:03 PM2/27/10

to

Fix the following compile error on PowerPC:

include/linux/workqueue.h:66: error: ‘NR_CPUS’ undeclared here (not in a function)

Signed-off-by: Anton Blanchard <an...@samba.org>
---

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 66573b8..5d1d9be 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -9,6 +9,7 @@
#include <linux/linkage.h>
#include <linux/bitops.h>
#include <linux/lockdep.h>
+#include <linux/threads.h>
#include <asm/atomic.h>

struct workqueue_struct;

Anton Blanchard

unread,

Feb 27, 2010, 8:10:01 PM2/27/10

to

Ensure ret is initialised to avoid the following warning:

kernel/workqueue.c: In function ‘work_busy’:
kernel/workqueue.c:2697: warning: ‘ret’ may be used uninitialized in this function

Signed-off-by: Anton Blanchard <an...@samba.org>
---

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5871708..6003afd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2682,7 +2682,7 @@ unsigned int work_busy(struct work_struct *work)
{
struct global_cwq *gcwq = get_work_gcwq(work);
unsigned long flags;
- unsigned int ret;
+ unsigned int ret = 0;

if (!gcwq)
return false;

Anton Blanchard

unread,

Feb 27, 2010, 8:20:01 PM2/27/10

to

Hi Tejun,

I gave the workqueue patches a spin on PowerPC. I'm particularly interested
from an OS jitter perspective, and that these patches wont introduce more
jitter. It looks like we reach a steady state of worker threads and arent
continually creating and destroying them which is good. This could be a big
deal on compute CPUs (CPUs isolated via isol_cpus or cpusets).

A few things I've found so far:

1. NR_CPUS > 32 causes issues with the workqueue debugfs code:

kernel/workqueue.c:3314: warning: left shift count >= width of type
kernel/workqueue.c:3323: warning: left shift count >= width of type
kernel/workqueue.c:3323: warning: integer overflow in expression
kernel/workqueue.c:3324: warning: enumeration values exceed range of largest integer
kernel/workqueue.c: In function ‘wq_debugfs_decode_pos’:
kernel/workqueue.c:3336: warning: right shift count is negative
kernel/workqueue.c:3337: warning: right shift count is negative
kernel/workqueue.c: In function ‘wq_debugfs_next_pos’:
kernel/workqueue.c:3435: warning: left shift count is negative
kernel/workqueue.c:3436: warning: left shift count is negative
kernel/workqueue.c: In function ‘wq_debugfs_start’:
kernel/workqueue.c:3455: warning: left shift count >= width of type

2. cifs needs to be converted:

fs/cifs/cifsfs.c: In function ‘exit_cifs’:
fs/cifs/cifsfs.c:1067: error: ‘system_single_wq’ undeclared (first use in this function)
fs/cifs/cifsfs.c:1067: error: (Each undeclared identifier is reported only once
fs/cifs/cifsfs.c:1067: error: for each function it appears in.)

Anton

Tejun Heo

unread,

Feb 28, 2010, 1:00:01 AM2/28/10

to

On 02/28/2010 07:52 AM, Anton Blanchard wrote:
> Fix the following compile error on PowerPC:
>
> include/linux/workqueue.h:66: error: ‘NR_CPUS’ undeclared here (not in a function)
>
> Signed-off-by: Anton Blanchard <an...@samba.org>

Fix folded into
0028-workqueue-carry-cpu-number-in-work-data-once-executi.patch

Thanks.

--
tejun

Tejun Heo

unread,

Feb 28, 2010, 1:10:02 AM2/28/10

to

Hello,

On 02/28/2010 10:00 AM, Anton Blanchard wrote:
> Ensure ret is initialised to avoid the following warning:
>
> kernel/workqueue.c: In function ‘work_busy’:
> kernel/workqueue.c:2697: warning: ‘ret’ may be used uninitialized in this function

Thanks a lot for catching this. gcc 4.4.1 building for x86_64 doesn't
seem to notice that. gcc has been relatively reliable in catching
these mistakes. I wonder what made it slip. My cross gcc 4.3.3 for
powerpc catches it. Strange. Anyways, fix folded into
0035-workqueue-implement-several-utility-APIs.patch

Thank you.

--
tejun

Tejun Heo

unread,

Feb 28, 2010, 1:30:02 AM2/28/10

to

Hello,

On 02/28/2010 10:11 AM, Anton Blanchard wrote:
> I gave the workqueue patches a spin on PowerPC. I'm particularly interested
> from an OS jitter perspective, and that these patches wont introduce more
> jitter. It looks like we reach a steady state of worker threads and arent
> continually creating and destroying them which is good. This could be a big
> deal on compute CPUs (CPUs isolated via isol_cpus or cpusets).

Yeap, it should reach a stable state very quickly.

> A few things I've found so far:
>
> 1. NR_CPUS > 32 causes issues with the workqueue debugfs code:

Heh heh, that's me using roundup_pow_of_two() where I should have used
order_base_2(). Fixed.

> 2. cifs needs to be converted:
>
> fs/cifs/cifsfs.c: In function ‘exit_cifs’:
> fs/cifs/cifsfs.c:1067: error: ‘system_single_wq’ undeclared (first use in this function)
> fs/cifs/cifsfs.c:1067: error: (Each undeclared identifier is reported only once
> fs/cifs/cifsfs.c:1067: error: for each function it appears in.)

Ah... right, fixed. Will soon update the git and patch tarball and
post the updated patches.

Thank you.

--
tejun

Oleg Nesterov

unread,

Feb 28, 2010, 9:20:01 AM2/28/10

to

On 02/26, Tejun Heo wrote:
>
> +static int stop_cpu(void *unused)
> {
> enum stopmachine_state curstate = STOPMACHINE_NONE;
> - struct stop_machine_data *smdata = &idle;
> + struct stop_machine_data *smdata;
> int cpu = smp_processor_id();
> int err;
>
> +repeat:
> + /* Wait for __stop_machine() to initiate */
> + while (true) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + /* <- kthread_stop() and __stop_machine()::smp_wmb() */
> + if (kthread_should_stop()) {
> + __set_current_state(TASK_RUNNING);
> + return 0;
> + }
> + if (state == STOPMACHINE_PREPARE)
> + break;

Cosmetic nit: this doesn't matter at all, but perhaps it makes sense
to set TASK_RUNNING here too.

Actually, I was a bit confused by this "while (true)" loop. It looks
as if a spurious wakeup is possible. It is not, and more importantly,
if it was possible stop_machine_cpu_callback(CPU_POST_DEAD) (which is
called after cpu_hotplug_done()) could race with stop_machine().
stop_machine_cpu_callback(CPU_POST_DEAD) relies on fact that this thread
has already called schedule() and it can't be woken until kthread_stop()
sets ->should_stop.

> + schedule();
> + }
> + smp_rmb(); /* <- __stop_machine()::set_state() */
> +
> + /* Okay, let's go */
> + smdata = &idle;
> if (!active_cpus) {
> if (cpu == cpumask_first(cpu_online_mask))
> smdata = &active;

I never understood why do we need "struct stop_machine_data idle".
stop_cpu() just needs a "bool should_call_active_fn" ?

> int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
> {

> ...

> /* Schedule the stop_cpu work on all cpus: hold this CPU so one
> * doesn't hit this CPU until we're ready. */
> get_cpu();

> + for_each_online_cpu(i)
> + wake_up_process(*per_cpu_ptr(stop_machine_threads, i));

I think the comment is wrong, and we need preempt_disable() instead
of get_cpu(). We shouldn't worry about this CPU, but we need to ensure
the woken real-time thread can't preempt us until we wake up them all.

Oleg.

Oleg Nesterov

unread,

Feb 28, 2010, 9:40:01 AM2/28/10

to

On 02/26, Tejun Heo wrote:
>

> @@ -164,19 +259,18 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
> idle.fn = chill;
> idle.data = NULL;
>

> + smp_wmb(); /* -> stop_cpu()::set_current_state() */

> ...

> + for_each_online_cpu(i)
> + wake_up_process(*per_cpu_ptr(stop_machine_threads, i));

Afaics, this smp_wmb() is not needed, wake_up_process() (try_to_wake_up)
should ensure we can't race with set_current_state() + check_condition.
It does, note the wmb() in try_to_wake_up().

Oleg.

Tejun Heo

unread,

Mar 1, 2010, 10:10:01 AM3/1/10

to

Hello,

On 02/28/2010 11:11 PM, Oleg Nesterov wrote:
> On 02/26, Tejun Heo wrote:
>>
>> +static int stop_cpu(void *unused)
>> {
>> enum stopmachine_state curstate = STOPMACHINE_NONE;
>> - struct stop_machine_data *smdata = &idle;
>> + struct stop_machine_data *smdata;
>> int cpu = smp_processor_id();
>> int err;
>>
>> +repeat:
>> + /* Wait for __stop_machine() to initiate */
>> + while (true) {
>> + set_current_state(TASK_INTERRUPTIBLE);
>> + /* <- kthread_stop() and __stop_machine()::smp_wmb() */
>> + if (kthread_should_stop()) {
>> + __set_current_state(TASK_RUNNING);
>> + return 0;
>> + }
>> + if (state == STOPMACHINE_PREPARE)
>> + break;
>
> Cosmetic nit: this doesn't matter at all, but perhaps it makes sense
> to set TASK_RUNNING here too.

Yeap, I agree that would be prettier. Will do so.

> Actually, I was a bit confused by this "while (true)" loop. It looks
> as if a spurious wakeup is possible. It is not,

I don't think spurious wakeups are possible but without the loop the
PREPARE check should be done before schedule(), and, after the
schedule(), we'll need a matching BUG_ON() and the
kthread_should_stop() check with a comment explaining that the initial
exit condition check is done in the kthread code and thus not
necessary before the initial schedule(). It seems more complex and
fragile to me.

> and more importantly, if it was possible
> stop_machine_cpu_callback(CPU_POST_DEAD) (which is called after
> cpu_hotplug_done()) could race with stop_machine().
> stop_machine_cpu_callback(CPU_POST_DEAD) relies on fact that this
> thread has already called schedule() and it can't be woken until
> kthread_stop() sets ->should_stop.

Hmmm... I'm probably missing something but I don't see how
stop_machine_cpu_callback(CPU_POST_DEAD) depends on stop_cpu() thread
already parked in schedule(). Can you elaborate a bit?

>> + schedule();
>> + }
>> + smp_rmb(); /* <- __stop_machine()::set_state() */
>> +
>> + /* Okay, let's go */
>> + smdata = &idle;
>> if (!active_cpus) {
>> if (cpu == cpumask_first(cpu_online_mask))
>> smdata = &active;
>
> I never understood why do we need "struct stop_machine_data idle".
> stop_cpu() just needs a "bool should_call_active_fn" ?

Yeap, it's an odd way to switch to no-op. I have no idea why the
original code looked like that. Maybe it has some history. At any
rate, easy to fix. I'll write up a patch to change it.

>> int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
>> {
>> ...
>> /* Schedule the stop_cpu work on all cpus: hold this CPU so one
>> * doesn't hit this CPU until we're ready. */
>> get_cpu();
>> + for_each_online_cpu(i)
>> + wake_up_process(*per_cpu_ptr(stop_machine_threads, i));
>
> I think the comment is wrong, and we need preempt_disable() instead
> of get_cpu(). We shouldn't worry about this CPU, but we need to ensure
> the woken real-time thread can't preempt us until we wake up them all.

get_cpu() and preempt_disable() are exactly the same thing, aren't
they? Do you think get_cpu() is wrong there for some reason? The
comment could be right depending on how you interpret 'this CPU' -
ie. you could read it as 'hold on to the CPU which is waking up
stop_machine_threads'. But I suppose there's no harm in clarifying
the comment.

Thanks.

--
tejun

Tejun Heo

unread,

Mar 1, 2010, 10:20:03 AM3/1/10

to

Hello, again.

On 02/28/2010 11:34 PM, Oleg Nesterov wrote:
> On 02/26, Tejun Heo wrote:
>>
>> @@ -164,19 +259,18 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
>> idle.fn = chill;
>> idle.data = NULL;
>>
>> + smp_wmb(); /* -> stop_cpu()::set_current_state() */
>> ...
>> + for_each_online_cpu(i)
>> + wake_up_process(*per_cpu_ptr(stop_machine_threads, i));
>
> Afaics, this smp_wmb() is not needed, wake_up_process() (try_to_wake_up)
> should ensure we can't race with set_current_state() + check_condition.
> It does, note the wmb() in try_to_wake_up().

Yeap, the initial version was like that and it was awkward to explain
in the comment in stop_cpu() so I basically put it there as a
documentation anchor. Do you think removing it would be better?

Thanks.

--
tejun

Oleg Nesterov

unread,

Mar 1, 2010, 10:50:01 AM3/1/10

to

Hello,

On 03/02, Tejun Heo wrote:
>
> > and more importantly, if it was possible
> > stop_machine_cpu_callback(CPU_POST_DEAD) (which is called after
> > cpu_hotplug_done()) could race with stop_machine().
> > stop_machine_cpu_callback(CPU_POST_DEAD) relies on fact that this
> > thread has already called schedule() and it can't be woken until
> > kthread_stop() sets ->should_stop.
>
> Hmmm... I'm probably missing something but I don't see how
> stop_machine_cpu_callback(CPU_POST_DEAD) depends on stop_cpu() thread
> already parked in schedule(). Can you elaborate a bit?

Suppose that, when stop_machine_cpu_callback(CPU_POST_DEAD) is called,
that stop_cpu() thread T is still running and it is going to check state
before schedule().

CPU_POST_DEAD is called after cpu_hotplug_done(), another CPU can do
stop_machine() and set STOPMACHINE_PREPARE.

If T sees state == STOPMACHINE_PREPARE it will join the game, but it
wasn't counted in thread_ack counter, it is not cpu-bound, etc.

> >> int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
> >> {
> >> ...
> >> /* Schedule the stop_cpu work on all cpus: hold this CPU so one
> >> * doesn't hit this CPU until we're ready. */
> >> get_cpu();
> >> + for_each_online_cpu(i)
> >> + wake_up_process(*per_cpu_ptr(stop_machine_threads, i));
> >
> > I think the comment is wrong, and we need preempt_disable() instead
> > of get_cpu(). We shouldn't worry about this CPU, but we need to ensure
> > the woken real-time thread can't preempt us until we wake up them all.
>
> get_cpu() and preempt_disable() are exactly the same thing, aren't
> they?

Yes,

> Do you think get_cpu() is wrong there for some reason?

No. I think that the comment is confusing, and preempt_disable()
"looks" more correct.

In any case, this is very minor, please ignore. In fact, I mentioned
this only because this email was much longer initially, at first I
thought I noticed the bug, but I was wrong ;)

Oleg.

Oleg Nesterov

unread,

Mar 1, 2010, 10:50:01 AM3/1/10

to

On 03/02, Tejun Heo wrote:
>

> Hello, again.
>
> On 02/28/2010 11:34 PM, Oleg Nesterov wrote:
> > On 02/26, Tejun Heo wrote:
> >>
> >> @@ -164,19 +259,18 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
> >> idle.fn = chill;
> >> idle.data = NULL;
> >>
> >> + smp_wmb(); /* -> stop_cpu()::set_current_state() */
> >> ...
> >> + for_each_online_cpu(i)
> >> + wake_up_process(*per_cpu_ptr(stop_machine_threads, i));
> >
> > Afaics, this smp_wmb() is not needed, wake_up_process() (try_to_wake_up)
> > should ensure we can't race with set_current_state() + check_condition.
> > It does, note the wmb() in try_to_wake_up().
>
> Yeap, the initial version was like that and it was awkward to explain
> in the comment in stop_cpu() so I basically put it there as a
> documentation anchor.

OK,

> Do you think removing it would be better?

No, I just wanted to understand what I have missed. This applies to all
my questions in this thread ;)

Thanks,

Oleg.

Tejun Heo

unread,

Mar 1, 2010, 11:30:02 AM3/1/10

to

Hello,

On 03/02/2010 12:37 AM, Oleg Nesterov wrote:
>> Hmmm... I'm probably missing something but I don't see how
>> stop_machine_cpu_callback(CPU_POST_DEAD) depends on stop_cpu() thread
>> already parked in schedule(). Can you elaborate a bit?
>
> Suppose that, when stop_machine_cpu_callback(CPU_POST_DEAD) is called,
> that stop_cpu() thread T is still running and it is going to check state
> before schedule().
>
> CPU_POST_DEAD is called after cpu_hotplug_done(), another CPU can do
> stop_machine() and set STOPMACHINE_PREPARE.
>
> If T sees state == STOPMACHINE_PREPARE it will join the game, but it
> wasn't counted in thread_ack counter, it is not cpu-bound, etc.

Oh, I see. I was thinking get/put_online_cpus() block is exclusive
against cpu_maps_update_begin/done() instead of
cpu_hotplug_begin/done(). Will update and add comments.

Thanks.

--
tejun

Oleg Nesterov

unread,

Mar 1, 2010, 12:00:04 PM3/1/10

to

On 03/02, Tejun Heo wrote:
>

> Hello,
>
> On 03/02/2010 12:37 AM, Oleg Nesterov wrote:
> >
> > Suppose that, when stop_machine_cpu_callback(CPU_POST_DEAD) is called,
> > that stop_cpu() thread T is still running and it is going to check state
> > before schedule().
>

> Oh, I see. I was thinking get/put_online_cpus() block is exclusive
> against cpu_maps_update_begin/done() instead of
> cpu_hotplug_begin/done(). Will update and add comments.

Agreed, a little comment can help.

But, just in case, I forgot to repeat this case is not possible anyway.
_cpu_down() ensures idle_cpu(cpu) == T after __stop_machine(), this means
that this SCHED_FIFO thread we are going to kthread_stop() later can't
be active.

Oleg.

Tejun Heo

unread,

Mar 1, 2010, 1:00:03 PM3/1/10

to

Hello,

On 03/02/2010 01:50 AM, Oleg Nesterov wrote:
>> Oh, I see. I was thinking get/put_online_cpus() block is exclusive
>> against cpu_maps_update_begin/done() instead of
>> cpu_hotplug_begin/done(). Will update and add comments.
>
> Agreed, a little comment can help.
>
> But, just in case, I forgot to repeat this case is not possible anyway.
> _cpu_down() ensures idle_cpu(cpu) == T after __stop_machine(), this means
> that this SCHED_FIFO thread we are going to kthread_stop() later can't
> be active.

Yeah, sure, that was what I was gonna write as comment. :-)

Thanks for reviewing and see you tomorrow. It's getting pretty late
(or early) here. Bye.

--
tejun

David Howells

unread,

Mar 10, 2010, 10:00:01 AM3/10/10

to

* queue_work - queue work on a workqueue
* @wq: workqueue to use
* @work: work to queue
*
* Returns 0 if @work was already on a queue, non-zero otherwise.
*
* We queue the work to the CPU on which it was submitted, but if the CPU dies
* it can be processed by another CPU.

So when a work item is running on a CPU, any work items it queues (including
requeueing itself) will be queued upon that CPU for attention only by that CPU
(assuming that CPU doesn't get pulled out)?

David

Tejun Heo

unread,

Mar 12, 2010, 12:10:02 AM3/12/10

to

Hello,

On 03/10/2010 11:52 PM, David Howells wrote:
> * queue_work - queue work on a workqueue
> * @wq: workqueue to use
> * @work: work to queue
> *
> * Returns 0 if @work was already on a queue, non-zero otherwise.
> *
> * We queue the work to the CPU on which it was submitted, but if the CPU dies
> * it can be processed by another CPU.
>
> So when a work item is running on a CPU, any work items it queues (including
> requeueing itself) will be queued upon that CPU for attention only by that CPU
> (assuming that CPU doesn't get pulled out)?

Yes, that's the one of the characteristics of workqueue. For IO bound
stuff, it usually is a plus.

Thanks.

--
tejun

David Howells

unread,

Mar 12, 2010, 6:30:02 AM3/12/10

to

Tejun Heo <t...@kernel.org> wrote:

> > So when a work item is running on a CPU, any work items it queues
> > (including requeueing itself) will be queued upon that CPU for attention
> > only by that CPU (assuming that CPU doesn't get pulled out)?
>
> Yes, that's the one of the characteristics of workqueue. For IO bound
> stuff, it usually is a plus.

Hmmm. So when, say, an FS-Cache index object finishes creating itself on disk
and then releases all the several thousand data objects waiting on that event
so that they can then create themselves, it will do this by calling
queue_work() on each of them in turn. I take it this will stack them all up on
that one CPU's work queue.

David

Tejun Heo

unread,

Mar 12, 2010, 6:00:02 PM3/12/10

to

Hello,

On 03/12/2010 08:23 PM, David Howells wrote:
>> Yes, that's the one of the characteristics of workqueue. For IO bound
>> stuff, it usually is a plus.
>
> Hmmm. So when, say, an FS-Cache index object finishes creating
> itself on disk and then releases all the several thousand data
> objects waiting on that event so that they can then create
> themselves, it will do this by calling queue_work() on each of them
> in turn. I take it this will stack them all up on that one CPU's
> work queue.

Well, you can RR queue them but in general I don't think things like
that would be much of a problem for IO bounded works. If it becomes
bad, scheduler will end up moving the source around and for most
common cases, those group queued works are gonna hit similar code
paths over and over again during their short CPU burn durations so
it's likely to be more efficient. Are you seeing maleffects of cpu
affine work scheduling during fscache load tests?

Thank you for testing.

--
tejun

David Howells

unread,

Mar 16, 2010, 10:50:02 AM3/16/10

to

Tejun Heo <t...@kernel.org> wrote:

> Well, you can RR queue them but in general I don't think things like
> that would be much of a problem for IO bounded works.

"RR queue"? Do you mean realtime?

> If it becomes bad, scheduler will end up moving the source around

"The source"? Do you mean the process that's loading the deferred work items
onto the workqueue? Why should it get moved? Isn't it pinned to a CPU?

> and for most common cases, those group queued works are gonna hit similar
> code paths over and over again during their short CPU burn durations so it's
> likely to be more efficient.

True.

> Are you seeing maleffects of cpu affine work scheduling during fscache load
> tests?

Hard to say. Hear are some benchmarks:

(*) SLOW-WORK, cold server, cold cache:

real 2m0.974s
user 0m0.492s
sys 0m15.593s

(*) SLOW-WORK, hot server, cold cache:

real 1m31.230s 1m13.408s
user 0m0.612s 0m0.652s
sys 0m17.845s 0m15.641s

(*) SLOW-WORK, hot server, warm cache:

real 3m22.108s 3m52.557s
user 0m0.636s 0m0.588s
sys 0m13.317s 0m16.101s

(*) SLOW-WORK, hot server, hot cache:

real 1m54.331s 2m2.745s
user 0m0.596s 0m0.608s
sys 0m11.457s 0m12.625s

(*) SLOW-WORK, hot server, no cache:

real 1m1.508s 0m54.973s
user 0m0.568s 0m0.712s
sys 0m15.457s 0m13.969s

(*) CMWQ, cold-ish server, cold cache:

real 1m5.154s
user 0m0.628s
sys 0m14.397s

(*) CMWQ, hot server, cold cache:

real 1m1.240s 1m4.012s
user 0m0.732s 0m0.576s
sys 0m13.053s 0m14.133s

(*) CMWQ, hot server, warm cache:

real 3m10.949s 4m9.805s
user 0m0.636s 0m0.648s
sys 0m14.065s 0m13.505s

(*) CMWQ, hot server, hot cache:

real 1m22.511s 2m57.075s
user 0m0.612s 0m0.604s
sys 0m11.629s 0m12.509s

Note that it took me several goes to get a second result for this case:
it kept failing in a way that suggested that the non-reentrancy stuff you
put in there failed somehow, but it's difficult to say for sure.

David

Tejun Heo

unread,

Mar 16, 2010, 12:10:02 PM3/16/10

to

Hello,

On 03/16/2010 11:38 PM, David Howells wrote:
>> Well, you can RR queue them but in general I don't think things like
>> that would be much of a problem for IO bounded works.
>
> "RR queue"? Do you mean realtime?

I meant round-robin as the last resort but if fscache really needs
such workaround, cmwq is probably a bad fit for it.

>> If it becomes bad, scheduler will end up moving the source around
>
> "The source"? Do you mean the process that's loading the deferred
> work items onto the workqueue? Why should it get moved? Isn't it
> pinned to a CPU?

Whatever the source may be. If a cpu gets loaded heavily from fscache
workload, things which aren't pinned to the cpu will be distributed to
other cpus. But again, I have difficult time imagining cpu loading
being an actual issue for fscache even in pathological cases. It's
almost strictly IO bound and CPU intensive stuff sitting in the IO
path already have or should grow mechanisms to schedule those properly
anyway.

>> and for most common cases, those group queued works are gonna hit similar
>> code paths over and over again during their short CPU burn durations so it's
>> likely to be more efficient.
>
> True.
>
>> Are you seeing maleffects of cpu affine work scheduling during
>> fscache load tests?
>
> Hard to say. Hear are some benchmarks:

Yay, some numbers. :-) I reorganized them for easier comparison.

(*) cold/cold-ish server, cold cache:

SLOW-WORK CMWQ
real 2m0.974s 1m5.154s
user 0m0.492s 0m0.628s
sys 0m15.593s 0m14.397s

(*) hot server, cold cache:

SLOW-WORK CMWQ
real 1m31.230s 1m13.408s 1m1.240s 1m4.012s
user 0m0.612s 0m0.652s 0m0.732s 0m0.576s
sys 0m17.845s 0m15.641s 0m13.053s 0m14.133s

(*) hot server, warm cache:

SLOW-WORK CMWQ
real 3m22.108s 3m52.557s 3m10.949s 4m9.805s
user 0m0.636s 0m0.588s 0m0.636s 0m0.648s
sys 0m13.317s 0m16.101s 0m14.065s 0m13.505s

(*) hot server, hot cache:

SLOW-WORK CMWQ
real 1m54.331s 2m2.745s 1m22.511s 2m57.075s
user 0m0.596s 0m0.608s 0m0.612s 0m0.604s
sys 0m11.457s 0m12.625s 0m11.629s 0m12.509s

(*) hot server, no cache:

SLOW-WORK CMWQ

real 1m1.508s 0m54.973s
user 0m0.568s 0m0.712s
sys 0m15.457s 0m13.969s

> Note that it took me several goes to get a second result for this

> case: it kept failing in a way that suggested that the
> non-reentrancy stuff you put in there failed somehow, but it's
> difficult to say for sure.

Sure, there could be a bug in the non-reentrance implementation but
I'm leaning more towards a bug in work flushing before freeing thing
which also seems to show up in the debugfs path. I'll try to
reproduce the problem here and debug it.

That said, the numbers look generally favorable to CMWQ although the
sample size is too small to draw conclusions. I'll try to get things
fixed up so that testing can be smoother.

Thanks a lot for testing.

--
tejun

David Howells

unread,

Mar 16, 2010, 1:30:02 PM3/16/10

to

Tejun Heo <t...@kernel.org> wrote:

> Sure, there could be a bug in the non-reentrance implementation but
> I'm leaning more towards a bug in work flushing before freeing thing
> which also seems to show up in the debugfs path. I'll try to
> reproduce the problem here and debug it.

I haven't managed to reproduce it since I reported it:-/

> That said, the numbers look generally favorable to CMWQ although the
> sample size is too small to draw conclusions. I'll try to get things
> fixed up so that testing can be smoother.

You have to take the numbers with a large pinch of salt, I think, in both
cases. Pulling over the otherwise unladen GigE network from the server with
the data in RAM is somewhat faster than sucking from disk. Furthermore, since
the test is massively parallel, with each thread reading separate data, the
result is going to be very much dependent on what order the reads happen to be
issued this time compared to the order they were issued when the cache was
filled.

I need to fix my slow-test server that's dangling at the end of an ethernet
over mains connection. That gives much more consistent results as the disk
speed is greater than the network connection speed.

Looking at the numbers, I think CMWQ may appear to give better results in the
cold-cache case by starting off confining many accesses to the cache to a
single CPU, given that cache object creation and data storage is done
asynchronously in the background. This is due to object creation getting
deferred until index creation is achieved (several loopups, mkdirs and
setxattrs), and then all dumped at once onto the CPU that handled the index
creation, as we discussed elsewhere.

The program I'm using to read the data doesn't give any real penalty when its
threads can't actually run in parallel, so it probably doesn't mind being
largely confined to the other CPU. But that's benchmarking for you...

You should probably also disregard the coldish-server numbers. I'm not sure
my desktop machine (which was acting as the server) was purged of the dataset.
I'd need to reboot the server to be sure, but that's inconvenient of my
desktop.

But, at a glance, the numbers don't appear to be too different. There are
cases where CMWQ definitely appears better, and some where it definitely
appears worse, but the spread is so huge, that could just be noise.

David

Tejun Heo

unread,

Apr 25, 2010, 4:20:01 AM4/25/10

to

Hello,

I've just updated the git tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-cmwq

The original take#4 is now in branch review-cmwq-3. This will
probably become take#5 soonish but I don't have access to my test
machines for some days, so this is sort of take#5 pre-release.
Changes are...

* The patchset is rebased on cpu_stop + sched/core. cpu_stop already
reimplements stop_machine so that it doesn't use RT workqueue, so
this patchset simply drops RT wq support.

* Oleg's clear work->data patch moved at the head of the queue and now
lives in the for-next branch which will be pushed to mainline on the
next merge window.

* David reported a bug where fscache show_work function caused panic
by accessing already dead object. This turned out to be race
condition between put() after execution and show(). If the put()
after work execution is the last put, the object gets destroyed;
however, show() could be called to describe the work is actually
finished ending up dereferencing already freed object. Fixed by
deferring put() to another work if debugfs support is enabled so
that the object stays alive while the work is executing. Due to
lack of test setup, I couldn't actually test this yet. I'll verify
it works as soon as I have access to my stuff.

* Applied Oleg's review.

* Comments updated as suggested.

* work_flags_to_color() replaced w/ get_work_color()

* nr_cwqs_to_flush bug which could cause premature flush completion
fixed.

* Replace rewind + list_for_each_entry_safe_continue() w/
list_for_each_entry_safe_from().

* Don't directly write to *work_data_bits() but use __set_bit()
instead.

* Fixed cpu hotplug exclusion bug.

Thanks.

--
tejun