[CPUISOL] CPU isolation extensions

ma...@qualcomm.com

unread,

Jan 27, 2008, 11:34:56 PM1/27/08

to linux-...@vger.kernel.org

Following patch series extends CPU isolation support. Yes, most people want to virtuallize
CPUs these days and I want to isolate them :).
The primary idea here is to be able to use some CPU cores as dedicated engines for running
user-space code with minimal kernel overhead/intervention, think of it as an SPE in the
Cell processor.

We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
In fact that the primary distinction that I'm making between say "CPU sets" and
"CPU isolation". "CPU sets" let you manage user-space load while "CPU isolation" provides
a way to isolate a CPU as much as possible (including kernel activities).

I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to
achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
multi- processor/core systems under exteme system load. I'm working with legal folks on releasing
hard RT user-space framework for that.
I can also see other application like simulators and stuff that can benefit from this.

I've been maintaining this stuff since around 2.6.18 and it's been running in production
environment for a couple of years now. It's been tested on all kinds of machines, from NUMA
boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110.
The messiest part used to be SLAB garbage collector changes. With the new SLUB all that mess
goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much better than O(1)
did (ie domains are recomputed dynamically) so that isolation can be done at any time (via sysfs).
So this seems like a good time to merge.

Anyway. The patchset consist of 5 patches. First three are very simple and non-controversial.
They simply make "CPU isolation" a configurable feature, export cpu_isolated_map and provide
some helper functions to access it (just like cpu_online() and friends).
Last two patches add support for isolating CPUs from running workqueus and stop machine.
More details in the individual patch descriptions.

Ideally I'd like all of this to go in during this merge window. If people think it's acceptable
Linus or Andrew (or whoever is more appropriate Ingo maybe) can pull this patch set from
git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git

That tree is rebased against latest (as of yesterday) Linus' tree.

Thanx
Max

arch/x86/Kconfig | 1
arch/x86/kernel/genapic_flat_64.c | 5 ++--
drivers/base/cpu.c | 47 ++++++++++++++++++++++++++++++++++++++
include/linux/cpumask.h | 3 ++
kernel/Kconfig.cpuisol | 25 +++++++++++++++++++-
kernel/sched.c | 13 ++++++----
kernel/stop_machine.c | 3 --
kernel/workqueue.c | 31 ++++++++++++++++++-------
8 files changed, 110 insertions(+), 18 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

unread,

Jan 28, 2008, 4:14:15 AM1/28/08

to ma...@qualcomm.com, linux-...@vger.kernel.org, Ingo Molnar, Steven Rostedt, Gregory Haskins, Paul Jackson

[ You really ought to CC people :-) ]

On Sun, 2008-01-27 at 20:09 -0800, ma...@qualcomm.com wrote:
> Following patch series extends CPU isolation support. Yes, most people want to virtuallize
> CPUs these days and I want to isolate them :).
> The primary idea here is to be able to use some CPU cores as dedicated engines for running
> user-space code with minimal kernel overhead/intervention, think of it as an SPE in the
> Cell processor.
>
> We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
> I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
> In fact that the primary distinction that I'm making between say "CPU sets" and
> "CPU isolation". "CPU sets" let you manage user-space load while "CPU isolation" provides
> a way to isolate a CPU as much as possible (including kernel activities).

Ok, so you're aware of CPU sets, miss a feature, but instead of
extending it to cover your needs you build something new entirely?

> I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to
> achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
> multi- processor/core systems under exteme system load. I'm working with legal folks on releasing
> hard RT user-space framework for that.
> I can also see other application like simulators and stuff that can benefit from this.

have you been using just this, or in combination with the -rt effort?

Paul Jackson

unread,

Jan 28, 2008, 9:59:57 AM1/28/08

to Peter Zijlstra, ma...@qualcomm.com, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Thanks for the CC, Peter.

Ingo - see question at end of message.

Max wrote:
> We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
> I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.

I recently added the per-cpuset flag 'sched_load_balance' for some
other realtime folks, so that they can disable the kernel scheduler
load balancing on isolated CPUs. It essentially allows for dynamic
control of which CPUs are isolated by the scheduler, using the cpuset
hierarchy, rather than enhancing the 'isolated_cpus' mask. That
'isolated_cpus' mask remained a minimal kernel boottime parameter.
I believe this went to Linus's tree about Oct 2007.

It looks like you have three additional tweaks for realtime in this
patch set, with your patches:

[PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot
[PATCH] [CPUISOL] Support for workqueue isolation
[PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"

It would be interesting to see a patchset with the above three realtime
tweaks, layered on this new cpuset 'sched_load_balance' apparatus, rather
than layered on changes to make 'isolated_cpus' more dynamic. Some of us
run realtime and cpuset-intensive loads on the same system, so like to
have those two capabilities co-operate with each other.

Ingo - what's your sense of the value of the above three realtime tweaks
(the last three patches in Max's patch set)?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@sgi.com> 1.940.382.4214

Steven Rostedt

unread,

Jan 28, 2008, 11:36:36 AM1/28/08

to Paul Jackson, Peter Zijlstra, ma...@qualcomm.com, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

On Mon, Jan 28, 2008 at 08:59:10AM -0600, Paul Jackson wrote:
> Thanks for the CC, Peter.

Thanks from me too.

> Max wrote:
> > We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
> > I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
>
> I recently added the per-cpuset flag 'sched_load_balance' for some
> other realtime folks, so that they can disable the kernel scheduler
> load balancing on isolated CPUs. It essentially allows for dynamic
> control of which CPUs are isolated by the scheduler, using the cpuset
> hierarchy, rather than enhancing the 'isolated_cpus' mask. That
> 'isolated_cpus' mask remained a minimal kernel boottime parameter.
> I believe this went to Linus's tree about Oct 2007.
>
> It looks like you have three additional tweaks for realtime in this
> patch set, with your patches:
>
> [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot

I didn't know we still routed IRQs to isolated CPUs. I guess I need to
look deeper into the code on this one. But I agree that isolated CPUs
should not have IRQs routed to them.

> [PATCH] [CPUISOL] Support for workqueue isolation

The thing about workqueues is that they should only be woken on a CPU if
something on that CPU accessed them. IOW, the workqueue on a CPU handles
work that was called by something on that CPU. Which means that
something that high prio task did triggered a workqueue to do some work.
But this can also be triggered by interrupts, so by keeping interrupts
off the CPU no workqueue should be activated.

> [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"

This I find very dangerous. We are making an assumption that tasks on an
isolated CPU wont be doing things that stopmachine requires. What stops
a task on an isolated CPU from calling something into the kernel that
stop_machine requires to halt?

-- Steve

>
> It would be interesting to see a patchset with the above three realtime
> tweaks, layered on this new cpuset 'sched_load_balance' apparatus, rather
> than layered on changes to make 'isolated_cpus' more dynamic. Some of us
> run realtime and cpuset-intensive loads on the same system, so like to
> have those two capabilities co-operate with each other.
>
> Ingo - what's your sense of the value of the above three realtime tweaks
> (the last three patches in Max's patch set)?
>
--

Peter Zijlstra

unread,

Jan 28, 2008, 11:45:41 AM1/28/08

to Steven Rostedt, Paul Jackson, ma...@qualcomm.com, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

On Mon, 2008-01-28 at 11:34 -0500, Steven Rostedt wrote:
> On Mon, Jan 28, 2008 at 08:59:10AM -0600, Paul Jackson wrote:
> > Thanks for the CC, Peter.
>
> Thanks from me too.
>
> > Max wrote:
> > > We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
> > > I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
> >
> > I recently added the per-cpuset flag 'sched_load_balance' for some
> > other realtime folks, so that they can disable the kernel scheduler
> > load balancing on isolated CPUs. It essentially allows for dynamic
> > control of which CPUs are isolated by the scheduler, using the cpuset
> > hierarchy, rather than enhancing the 'isolated_cpus' mask. That
> > 'isolated_cpus' mask remained a minimal kernel boottime parameter.
> > I believe this went to Linus's tree about Oct 2007.
> >
> > It looks like you have three additional tweaks for realtime in this
> > patch set, with your patches:
> >
> > [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot
>
> I didn't know we still routed IRQs to isolated CPUs. I guess I need to
> look deeper into the code on this one. But I agree that isolated CPUs
> should not have IRQs routed to them.

While I agree with this in principle, I'm not sure flat out denying all
IRQs to these cpus is a good option. What about the case where we want
to service just this one specific IRQ on this CPU and no others?

Can't this be done by userspace irq routing as used by irqbalanced?

> > [PATCH] [CPUISOL] Support for workqueue isolation
>
> The thing about workqueues is that they should only be woken on a CPU if
> something on that CPU accessed them. IOW, the workqueue on a CPU handles
> work that was called by something on that CPU. Which means that
> something that high prio task did triggered a workqueue to do some work.
> But this can also be triggered by interrupts, so by keeping interrupts
> off the CPU no workqueue should be activated.

Quite so, if nobody uses it, there is no harm in having them around. If
they are used, its by someone already allowed on the cpu.

> > [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
>
> This I find very dangerous. We are making an assumption that tasks on an
> isolated CPU wont be doing things that stopmachine requires. What stops
> a task on an isolated CPU from calling something into the kernel that
> stop_machine requires to halt?

Very dangerous indeed!

Max Krasnyanskiy

unread,

Jan 28, 2008, 1:38:34 PM1/28/08

to Paul Jackson, Peter Zijlstra, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Paul Jackson wrote:
> Thanks for the CC, Peter.
>
> Ingo - see question at end of message.
>
> Max wrote:
>> We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
>> I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
>
> I recently added the per-cpuset flag 'sched_load_balance' for some
> other realtime folks, so that they can disable the kernel scheduler
> load balancing on isolated CPUs. It essentially allows for dynamic
> control of which CPUs are isolated by the scheduler, using the cpuset
> hierarchy, rather than enhancing the 'isolated_cpus' mask. That
> 'isolated_cpus' mask remained a minimal kernel boottime parameter.
> I believe this went to Linus's tree about Oct 2007.
>
> It looks like you have three additional tweaks for realtime in this
> patch set, with your patches:
>
> [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot
> [PATCH] [CPUISOL] Support for workqueue isolation
> [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
>
> It would be interesting to see a patchset with the above three realtime
> tweaks, layered on this new cpuset 'sched_load_balance' apparatus, rather
> than layered on changes to make 'isolated_cpus' more dynamic. Some of us
> run realtime and cpuset-intensive loads on the same system, so like to
> have those two capabilities co-operate with each other.

I'll definitely take a look. So far it seems that extending cpu_isolated_map
is more natural way of propagating this notion to the rest of the kernel.
Since it's very similar to the cpu_online_map concept and it's easy to integrated
with the code that already uses it.
Anyway. I'll take a look at the cpuset flag that you mentioned and report back.

Thanx
Max

Max Krasnyanskiy

unread,

Jan 28, 2008, 1:48:40 PM1/28/08

to Steven Rostedt, Paul Jackson, Peter Zijlstra, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Steven Rostedt wrote:
> On Mon, Jan 28, 2008 at 08:59:10AM -0600, Paul Jackson wrote:
>> Thanks for the CC, Peter.
>
> Thanks from me too.
>
>> Max wrote:
>>> We've had scheduler support for CPU isolation ever since O(1) scheduler went it.
>>> I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
>> I recently added the per-cpuset flag 'sched_load_balance' for some
>> other realtime folks, so that they can disable the kernel scheduler
>> load balancing on isolated CPUs. It essentially allows for dynamic
>> control of which CPUs are isolated by the scheduler, using the cpuset
>> hierarchy, rather than enhancing the 'isolated_cpus' mask. That
>> 'isolated_cpus' mask remained a minimal kernel boottime parameter.
>> I believe this went to Linus's tree about Oct 2007.
>>
>> It looks like you have three additional tweaks for realtime in this
>> patch set, with your patches:
>>
>> [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot
>
> I didn't know we still routed IRQs to isolated CPUs. I guess I need to
> look deeper into the code on this one. But I agree that isolated CPUs
> should not have IRQs routed to them.

Also note that it's just a convenience feature. In other words it's not that with this patch
we'll never route IRQs to those CPUs. They can still be explicitly routed by writing to
irq/N/smp_affitnity.

>> [PATCH] [CPUISOL] Support for workqueue isolation
>
> The thing about workqueues is that they should only be woken on a CPU if
> something on that CPU accessed them. IOW, the workqueue on a CPU handles
> work that was called by something on that CPU. Which means that
> something that high prio task did triggered a workqueue to do some work.
> But this can also be triggered by interrupts, so by keeping interrupts
> off the CPU no workqueue should be activated.

No no no. That's what I though too ;-). The problem is that things like NFS and friends
expect _all_ their workqueue threads to report back when they do certain things like
flushing buffers and stuff. The reason I added this is because my machines were getting
stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even though no IRQs
or other things are running on it.

>> [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
>
> This I find very dangerous. We are making an assumption that tasks on an
> isolated CPU wont be doing things that stopmachine requires. What stops
> a task on an isolated CPU from calling something into the kernel that
> stop_machine requires to halt?

I agree in general. The thing is though that stop machine just kills any kind of latency
guaranties. Without the patch the machine just hangs waiting for the stop-machine to run
when module is inserted/removed. And running without dynamic module loading is not very
practical on general purpose machines. So I'd rather have an option with a big red warning
than no option at all :).

Thanx
Max

Max Krasnyanskiy

unread,

Jan 28, 2008, 1:56:38 PM1/28/08

to Peter Zijlstra, Steven Rostedt, Paul Jackson, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Peter, I think you missed the point of this patch. It's just a convenience feature.
It simply excludes isolated CPUs from IRQ smp affinity masks. That's all. What did you
mean by "flat out denying all IRQs to these cpus" ? IRQs can still be routed to them
by writing to /proc/irq/N/smp_affinity.

Also, this happens naturally when we bring a CPU off-line and then bring it back online.
ie When CPU comes back online it's excluded from the IRQ smp_affinity masks even without
my patch.

>>> [PATCH] [CPUISOL] Support for workqueue isolation
>> The thing about workqueues is that they should only be woken on a CPU if
>> something on that CPU accessed them. IOW, the workqueue on a CPU handles
>> work that was called by something on that CPU. Which means that
>> something that high prio task did triggered a workqueue to do some work.
>> But this can also be triggered by interrupts, so by keeping interrupts
>> off the CPU no workqueue should be activated.
>
> Quite so, if nobody uses it, there is no harm in having them around. If
> they are used, its by someone already allowed on the cpu.

No no no. I just replied to Steven about that. The problem is that things like NFS and

friends expect _all_ their workqueue threads to report back when they do certain things
like flushing buffers and stuff. The reason I added this is because my machines were
getting stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even though

no IRQs, softirqs or other things are running on it.

>>> [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
>> This I find very dangerous. We are making an assumption that tasks on an
>> isolated CPU wont be doing things that stopmachine requires. What stops
>> a task on an isolated CPU from calling something into the kernel that
>> stop_machine requires to halt?
>
> Very dangerous indeed!

Please see my reply to Steven. I agree it's somewhat dangerous. What we could do is make it
configurable with a big fat warning. In other words I'd rather have an option than just says
"do not use dynamic module loading" on those systems.

Max

Steven Rostedt

unread,

Jan 28, 2008, 2:01:51 PM1/28/08

to Max Krasnyanskiy, Paul Jackson, Peter Zijlstra, LKML, Ingo Molnar, Gregory Haskins

On Mon, 28 Jan 2008, Max Krasnyanskiy wrote:
> >> [PATCH] [CPUISOL] Support for workqueue isolation
> >
> > The thing about workqueues is that they should only be woken on a CPU if
> > something on that CPU accessed them. IOW, the workqueue on a CPU handles
> > work that was called by something on that CPU. Which means that
> > something that high prio task did triggered a workqueue to do some work.
> > But this can also be triggered by interrupts, so by keeping interrupts
> > off the CPU no workqueue should be activated.

> No no no. That's what I though too ;-). The problem is that things like NFS and friends
> expect _all_ their workqueue threads to report back when they do certain things like
> flushing buffers and stuff. The reason I added this is because my machines were getting
> stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even though no IRQs
> or other things are running on it.

This sounds more like we should fix NFS than add this for all workqueues.
Again, we want workqueues to run on the behalf of whatever is running on
that CPU, including those tasks that are running on an isolcpu.

>
> >> [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
> >
> > This I find very dangerous. We are making an assumption that tasks on an
> > isolated CPU wont be doing things that stopmachine requires. What stops
> > a task on an isolated CPU from calling something into the kernel that
> > stop_machine requires to halt?

> I agree in general. The thing is though that stop machine just kills any kind of latency
> guaranties. Without the patch the machine just hangs waiting for the stop-machine to run
> when module is inserted/removed. And running without dynamic module loading is not very
> practical on general purpose machines. So I'd rather have an option with a big red warning
> than no option at all :).

Well, that's something one of the greater powers (Linus, Andrew, Ingo)
must decide. ;-)

-- Steve

Paul Jackson

unread,

Jan 28, 2008, 2:07:08 PM1/28/08

to Max Krasnyanskiy, a.p.zi...@chello.nl, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Max wrote:
> So far it seems that extending cpu_isolated_map
> is more natural way of propagating this notion to the rest of the kernel.
> Since it's very similar to the cpu_online_map concept and it's easy to integrated
> with the code that already uses it.

If it were just realtime support, then I suspect I'd agree that
extending cpu_isolated_map makes more sense.

But some people use realtime on systems that are also heavily
managed using cpusets. The two have to work together. I have
customers with systems running realtime on a few CPUs, at the
same time that they have a large batch scheduler (which is layered
on top of cpusets) managing jobs on a few hundred other CPUs.
Hence with the cpuset 'sched_load_balance' flag I think I've already
done what I think is one part of what your patches achieve by extending
the cpu_isolated_map.

This is a common situation with "resource management" mechanisms such
as cpusets (and more recently cgroups and the subsystem modules it
supports.) They cut across existing core kernel code that manages such
key resources as CPUs and memory. As best we can, they have to work
with each other.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@sgi.com> 1.940.382.4214

Peter Zijlstra

unread,

Jan 28, 2008, 3:23:27 PM1/28/08

to Steven Rostedt, Max Krasnyanskiy, Paul Jackson, LKML, Ingo Molnar, Gregory Haskins

On Mon, 2008-01-28 at 14:00 -0500, Steven Rostedt wrote:
>
> On Mon, 28 Jan 2008, Max Krasnyanskiy wrote:
> > >> [PATCH] [CPUISOL] Support for workqueue isolation
> > >
> > > The thing about workqueues is that they should only be woken on a CPU if
> > > something on that CPU accessed them. IOW, the workqueue on a CPU handles
> > > work that was called by something on that CPU. Which means that
> > > something that high prio task did triggered a workqueue to do some work.
> > > But this can also be triggered by interrupts, so by keeping interrupts
> > > off the CPU no workqueue should be activated.
>
> > No no no. That's what I though too ;-). The problem is that things like NFS and friends
> > expect _all_ their workqueue threads to report back when they do certain things like
> > flushing buffers and stuff. The reason I added this is because my machines were getting
> > stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even though no IRQs
> > or other things are running on it.
>
> This sounds more like we should fix NFS than add this for all workqueues.
> Again, we want workqueues to run on the behalf of whatever is running on
> that CPU, including those tasks that are running on an isolcpu.

agreed, by looking at my top output (and not the nfs code) it looks like
it just spawns a configurable number of active kernel threads which are
not cpu bound by in any way. I think just removing the isolated cpus
from their runnable mask should take care of them.

>
> >
> > >> [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
> > >
> > > This I find very dangerous. We are making an assumption that tasks on an
> > > isolated CPU wont be doing things that stopmachine requires. What stops
> > > a task on an isolated CPU from calling something into the kernel that
> > > stop_machine requires to halt?
>
> > I agree in general. The thing is though that stop machine just kills any kind of latency
> > guaranties. Without the patch the machine just hangs waiting for the stop-machine to run
> > when module is inserted/removed. And running without dynamic module loading is not very
> > practical on general purpose machines. So I'd rather have an option with a big red warning
> > than no option at all :).
>
> Well, that's something one of the greater powers (Linus, Andrew, Ingo)
> must decide. ;-)

I'm in favour of better engineered method, that is, we really should try
to solve these problems in a proper way. Hacks like this might be fine
for custom kernels, but I think we should have a higher standard when it
comes to upstream - we all have to live many years with whatever we put
in there, we'd better think well about it.

Max Krasnyanskiy

unread,

Jan 28, 2008, 4:43:42 PM1/28/08

to Peter Zijlstra, Steven Rostedt, Paul Jackson, LKML, Ingo Molnar, Gregory Haskins

Peter Zijlstra wrote:
> On Mon, 2008-01-28 at 14:00 -0500, Steven Rostedt wrote:
>> On Mon, 28 Jan 2008, Max Krasnyanskiy wrote:
>>>>> [PATCH] [CPUISOL] Support for workqueue isolation
>>>> The thing about workqueues is that they should only be woken on a CPU if
>>>> something on that CPU accessed them. IOW, the workqueue on a CPU handles
>>>> work that was called by something on that CPU. Which means that
>>>> something that high prio task did triggered a workqueue to do some work.
>>>> But this can also be triggered by interrupts, so by keeping interrupts
>>>> off the CPU no workqueue should be activated.
>>> No no no. That's what I though too ;-). The problem is that things like NFS and friends
>>> expect _all_ their workqueue threads to report back when they do certain things like
>>> flushing buffers and stuff. The reason I added this is because my machines were getting
>>> stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even though no IRQs
>>> or other things are running on it.
>> This sounds more like we should fix NFS than add this for all workqueues.
>> Again, we want workqueues to run on the behalf of whatever is running on
>> that CPU, including those tasks that are running on an isolcpu.
>
> agreed, by looking at my top output (and not the nfs code) it looks like
> it just spawns a configurable number of active kernel threads which are
> not cpu bound by in any way. I think just removing the isolated cpus
> from their runnable mask should take care of them.

Actually NFS was just one example. I cannot remember of a top of my head what else was there
but there are definitely other users of work queues that expect all the threads to run at
some point in time.
Also if you think about it. The patch does _exactly_ what you propose. It removes workqueue
threads from isolated CPUs. But instead of doing just for NFS and/or other subsystems
separately it just does it in a generic way by simply not starting those threads in first
place.

>>>>> [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"
>>>> This I find very dangerous. We are making an assumption that tasks on an
>>>> isolated CPU wont be doing things that stopmachine requires. What stops
>>>> a task on an isolated CPU from calling something into the kernel that
>>>> stop_machine requires to halt?
>>> I agree in general. The thing is though that stop machine just kills any kind of latency
>>> guaranties. Without the patch the machine just hangs waiting for the stop-machine to run
>>> when module is inserted/removed. And running without dynamic module loading is not very
>>> practical on general purpose machines. So I'd rather have an option with a big red warning
>>> than no option at all :).
>> Well, that's something one of the greater powers (Linus, Andrew, Ingo)
>> must decide. ;-)
>
> I'm in favour of better engineered method, that is, we really should try
> to solve these problems in a proper way. Hacks like this might be fine
> for custom kernels, but I think we should have a higher standard when it
> comes to upstream - we all have to live many years with whatever we put
> in there, we'd better think well about it.

100% agree. That's why I said mentioned that this patches is controversial in the first place.
Right now those short from rewriting module loading to not use stop machine there is no other
option. I'll think some more about it. If you guys have other ideas please drop me a note.

Thanx
Max

Max Krasnyanskiy

unread,

Jan 28, 2008, 4:48:58 PM1/28/08

to Paul Jackson, a.p.zi...@chello.nl, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Paul Jackson wrote:
> Max wrote:
>> So far it seems that extending cpu_isolated_map
>> is more natural way of propagating this notion to the rest of the kernel.
>> Since it's very similar to the cpu_online_map concept and it's easy to integrated
>> with the code that already uses it.
>
> If it were just realtime support, then I suspect I'd agree that
> extending cpu_isolated_map makes more sense.
>
> But some people use realtime on systems that are also heavily
> managed using cpusets. The two have to work together. I have
> customers with systems running realtime on a few CPUs, at the
> same time that they have a large batch scheduler (which is layered
> on top of cpusets) managing jobs on a few hundred other CPUs.
> Hence with the cpuset 'sched_load_balance' flag I think I've already
> done what I think is one part of what your patches achieve by extending
> the cpu_isolated_map.
>
> This is a common situation with "resource management" mechanisms such
> as cpusets (and more recently cgroups and the subsystem modules it
> supports.) They cut across existing core kernel code that manages such
> key resources as CPUs and memory. As best we can, they have to work
> with each other.

Thanks for the info Paul. I'll definitely look into using this flag instead
and reply with pros and cons (if any).

Max

Daniel Walker

unread,

Jan 28, 2008, 6:44:06 PM1/28/08

to Max Krasnyanskiy, Peter Zijlstra, linux-...@vger.kernel.org, Ingo Molnar, Steven Rostedt, Gregory Haskins, Paul Jackson

On Mon, 2008-01-28 at 10:32 -0800, Max Krasnyanskiy wrote:
> Just this patches. RT patches cannot achieve what I needed. Even RTAI/Xenomai can't do that.
> For example I have separate tasks with hard deadlines that must be enforced in 50usec kind
> of range and basically no idle time whatsoever. Just to give more background it's a wireless
> basestation with SW MAC/Scheduler. Another requirement is for the SW to know precise timing
> because SW. For example there is no way we can do predictable 1-2 usec sleeps.
> So I wrote a user-space engine that does all this, it requires full control of the CPU ie minimal
> overhead from the kernel, just IPIs for memory management and that's basically it. When my legal
> department lets me I'll do a presentation on this stuff at Linux RT conference or something.

What kind of hardware are you doing this on? Also I should note there is
HRT (High resolution timers) which provided microsecond level
granularity ..

Daniel

Max Krasnyanskiy

unread,

Jan 28, 2008, 7:17:33 PM1/28/08

to Daniel Walker, Peter Zijlstra, linux-...@vger.kernel.org, Ingo Molnar, Steven Rostedt, Gregory Haskins, Paul Jackson

Daniel Walker wrote:
> On Mon, 2008-01-28 at 10:32 -0800, Max Krasnyanskiy wrote:
>> Just this patches. RT patches cannot achieve what I needed. Even RTAI/Xenomai can't do that.
>> For example I have separate tasks with hard deadlines that must be enforced in 50usec kind
>> of range and basically no idle time whatsoever. Just to give more background it's a wireless
>> basestation with SW MAC/Scheduler. Another requirement is for the SW to know precise timing
>> because SW. For example there is no way we can do predictable 1-2 usec sleeps.
>> So I wrote a user-space engine that does all this, it requires full control of the CPU ie minimal
>> overhead from the kernel, just IPIs for memory management and that's basically it. When my legal
>> department lets me I'll do a presentation on this stuff at Linux RT conference or something.
>
> What kind of hardware are you doing this on?

All kinds of HW. I mentioned it in the intro email.
Here are the highlights
HP XW9300 (Dual Opteron NUMA box) and XW9400 (Dual Core Opteron)
HP DL145 G2 (Dual Opteron) and G3 (Dual Core Opteron)
Dell Precision workstations (Core2 Duo and Quad)
Various Core2 Duo based systems uTCA boards
Mercury AXA110 (1.5Ghz)
Concurrent Tech AM110 (2.1Ghz)

This scheme should work on anything that lets you disable SMI on the isolated core(s).

> Also I should note there is HRT (High resolution timers) which provided microsecond level
> granularity ..

Not accurate enough and way too much overhead for what I need. I know at this point it probably
sounds like I'm talking BS :). I wish I've released the engine and examples by now. Anyway let
me just say that SW MAC has crazy tight deadlines with lots of small tasks. Using nanosleep() &
gettimeofday() is simply not practical. So it's all TSC based with clever time sync logic between
HW and SW.

Max

Daniel Walker

unread,

Jan 28, 2008, 8:35:49 PM1/28/08

to Max Krasnyanskiy, Peter Zijlstra, linux-...@vger.kernel.org, Ingo Molnar, Steven Rostedt, Gregory Haskins, Paul Jackson

On Mon, 2008-01-28 at 16:12 -0800, Max Krasnyanskiy wrote:

> Not accurate enough and way too much overhead for what I need. I know at this point it probably
> sounds like I'm talking BS :). I wish I've released the engine and examples by now. Anyway let
> me just say that SW MAC has crazy tight deadlines with lots of small tasks. Using nanosleep() &
> gettimeofday() is simply not practical. So it's all TSC based with clever time sync logic between
> HW and SW.

I don't know if it's BS or not, you clearly fixed your own problem which
is good .. Although when you say "RT patches cannot achieve what I
needed. Even RTAI/Xenomai can't do that." , and HRT is "Not accurate
enough and way too much overhead" .. Given the hardware your using,
that's all difficult to believe.. You also said this code has been
running on production systems for two year, which means it's at least
two years old .. There's been some good sized leaps in real time linux
in the past two years ..

Daniel

Mark Hounschell

unread,

Jan 31, 2008, 8:10:55 AM1/31/08

to ma...@qualcomm.com, linux-...@vger.kernel.org, Mark Hounschell

It's good to see hear from someone else that thinks a multi-processor
box _should_ be able to run a CPU intensive (%100) RT app on one of the
processors without adversely affecting or being affected by the others.
I have had issues that were _traced_ back to the fact that I am doing
just that. All I got was, you can't do that or we don't support that
kind of thing in the Linux kernel.

One example, Andrew Mortons feedback to the LKML thread "floppy.c soft
lockup"

Good luck with this. I hope this gets someones attention.

BTW, I have tried your patches against a vanilla 2.6.24 kernel but am
not successful.

# echo '1' > /sys/devices/system/cpu/cpu1/isolated
bash: echo: write error: Device or resource busy

The cpuisol=1 cmdline option yields:

harley:# cat /sys/devices/system/cpu/cpu1/isolated
0

harley:# cat /proc/cmdline
root=/dev/sda3 vga=normal apm=off selinux=0 noresume splash=silent
kmalloc=192M cpuisol=1

Regards
Mark

Max Krasnyanskiy

unread,

Jan 31, 2008, 2:13:30 PM1/31/08

to Mark Hounschell, linux-...@vger.kernel.org

Hi Mark,

Thanks for the support. I do the best I can because just like you I believe that it's
a perfectly valid workload and there a lot of interesting applications that will benefit
from mainline support.

> BTW, I have tried your patches against a vanilla 2.6.24 kernel but am
> not successful.
>
> # echo '1' > /sys/devices/system/cpu/cpu1/isolated
> bash: echo: write error: Device or resource busy

You have to bring it offline first.
In other words:
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/isolated
echo 1 > /sys/devices/system/cpu/cpu1/online

> The cpuisol=1 cmdline option yields:
>
> harley:# cat /sys/devices/system/cpu/cpu1/isolated
> 0
>
> harley:# cat /proc/cmdline
> root=/dev/sda3 vga=normal apm=off selinux=0 noresume splash=silent
> kmalloc=192M cpuisol=1

Sorry my bad. I had a typo in the patch description the option is "isolcpus=N".
We've had that option for awhile now. I mean it's not even part of my patch.

Thanx
Max

Max Krasnyanskiy

unread,

Jan 31, 2008, 2:14:05 PM1/31/08

to Paul Jackson, a.p.zi...@chello.nl, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Paul Jackson wrote:
> Max wrote:
>> So far it seems that extending cpu_isolated_map
>> is more natural way of propagating this notion to the rest of the kernel.
>> Since it's very similar to the cpu_online_map concept and it's easy to integrated
>> with the code that already uses it.
>
> If it were just realtime support, then I suspect I'd agree that
> extending cpu_isolated_map makes more sense.
>
> But some people use realtime on systems that are also heavily
> managed using cpusets. The two have to work together. I have
> customers with systems running realtime on a few CPUs, at the
> same time that they have a large batch scheduler (which is layered
> on top of cpusets) managing jobs on a few hundred other CPUs.
> Hence with the cpuset 'sched_load_balance' flag I think I've already
> done what I think is one part of what your patches achieve by extending
> the cpu_isolated_map.
>
> This is a common situation with "resource management" mechanisms such
> as cpusets (and more recently cgroups and the subsystem modules it
> supports.) They cut across existing core kernel code that manages such
> key resources as CPUs and memory. As best we can, they have to work
> with each other.

Hi Paul,

I thought some more about your proposal to use sched_load_balance flag in cpusets instead
of extending cpu_isolated_map. I looked at the cpusets, cgroups, latest thread started by
Peter (about sched domains and stuff) and here are my thoughts on this.

Here is the list of things of issues with sched_load_balance flag from CPU isolation
perspective:
--
(1) Boot time isolation is not possible. There is currently no way to setup a cpuset at
boot time. For example we won't be able to isolate cpus from irqs and workqueues at boot.
Not a major issue but still an inconvenience.

--
(2) There is currently no easy way to figure out what cpuset a cpu belongs to in order
to query it's sched_load_balance flag. In order to do that we need a method that iterates
all active cpusets and checks their cpus_allowed masks. This implies holding cgroup and
cpuset mutexes. It's not clear whether it's ok to do that from the the contexts CPU
isolation happens in (apic, sched, workqueue). It seems that cgroup/cpuset api is designed
from top down access. ie adding a cpu to a set and then recomputing domains. Which makes
perfect sense for the common cpuset usecase but is not what cpu isolation needs.
In other words I think it's much simpler and cleaner to use the cpu_isolated_map for isolation
purposes.

--
(3) cpusets are a bit too dynamic :). What I mean by this is that sched_load_balance flag
can be changed at any time without bringing a CPU offline. What that means is that we'll
need some notifier mechanisms for killing and restarting workqueue threads when that flag
changes. Also we'd need some logic that makes sure that a user does not disable load balancing
on all cpus because that effectively will kill workqueues on all the cpus.
This particular case is already handled very nicely in my patches. Isolated bit can be set
only when cpu is offline and it cannot be set on the first online cpu. Workqueus and other
subsystems already handle cpu hotplug events nicely and can easily ignore isolated cpus when
they come online.

-----

#1 is probably unfixable. #2 and #3 can be fixed but at the expense of extra complexity across
the board. I seriously doubt that I'll be able to push that through the reviews ;-).

Also personally I still think cpusets and cpu isolation attack two different problems. cpusets
is about partitioning cpus and memory nodes, and managing tasks. Most of the cgroups/cpuset APIs
are designed to deal with tasks. CPU isolation is much simpler and is at the lower layer. It deals
with IRQs, kernel per cpu threads, etc. The only intersection I see is that both features affect
scheduling domains (cpu isolation is again simple here it just puts cpus into null domains and
that's an existing logic in sched.c nothing new here).
So here are some proposal on how we can make them play nicely with each other.

--
(A) Make cpusets aware of isolated cpus.
All we have to do here is to change
guarantee_online_cpus()
common_cpu_mem_hotplug_unplug()
to exclude cpu_isolated_map from cpu_online_map before using it.
And we'd need to change
update_cpumasks()
to simply ignore isolated cpus.

That way if a cpu is isolated it'll be ignored by the cpusets logic. Which I believe would be
correct behavior.
We're talking trivial ~5 liner patch which will be noop if cpu isolation is disabled.

(B) Ignore isolated map in cpuset. That's the current state of affairs with my patches applied.
Looks like your customers are happy with what they have now so they will probably not enable
cpu isolation anyway :).

(C) Introduce cpu_usable_map. That map will be recomputed on hotplug events. Essentially it'd be
cpu_online_map AND ~cpu_isolated_map. Convert things like cpusets to use that map instead of
online map.

We can probably come up with other options. My preference would be option (A).
I can kook up a patch for this and re-send the patch series.
What do you think ?

btw My impression is that we're talking about very different use cases here. You're talking big
machines with lots of cpus and I'm thinking your probably talking soft RT here, probably RT
networking services or something like that.

Use case I'm talking about is a dedicated machine for a certain task. Like HW simulator, wireless
base station with SW MAC, etc. For this in any foreseeable future most common configuration will
be 2-8 cores. cpusets is probably an overkill here because apps will want to manage thread affinities
themselves anyways (for example right now we bind soft-RT threads to CPU0 and hard-RT thread to CPU1).

Sorry for the typos :)
Max

Paul Jackson

unread,

Feb 2, 2008, 1:20:58 AM2/2/08

to Max Krasnyanskiy, a.p.zi...@chello.nl, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Max wrote:
> Here is the list of things of issues with sched_load_balance flag from CPU isolation
> perspective:

A separate thread happened to start up on lkml.org, shortly after
yours, that went into this in considerable detail.

For example, the interaction of cpusets, sched_load_balance,
sched_domains and real time scheduling is examined in some detail on
this thread. Everyone participating on that thread learned something
(we all came into it with less than a full picture of what's there.)

I would encourage you to read it closely. For example, the scheduler
code should not be trying to access per-cpuset attributes such as
the sched_load_balance flag (you are correct that this would be
difficult to do because of the locking; however by design, that is
not to be done.)

This thread begins at:

scheduler scalability - cgroups, cpusets and load-balancing
http://lkml.org/lkml/2008/1/29/60

Too bad we didn't think to include you in the CC list of that
thread from the beginning.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@sgi.com> 1.940.382.4214

Max Krasnyansky

unread,

Feb 3, 2008, 12:59:46 AM2/3/08

to Paul Jackson, a.p.zi...@chello.nl, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Paul Jackson wrote:
> Max wrote:
>> Here is the list of things of issues with sched_load_balance flag from CPU isolation
>> perspective:
>
> A separate thread happened to start up on lkml.org, shortly after
> yours, that went into this in considerable detail.
>
> For example, the interaction of cpusets, sched_load_balance,
> sched_domains and real time scheduling is examined in some detail on
> this thread. Everyone participating on that thread learned something
> (we all came into it with less than a full picture of what's there.)
>
> I would encourage you to read it closely. For example, the scheduler
> code should not be trying to access per-cpuset attributes such as
> the sched_load_balance flag (you are correct that this would be
> difficult to do because of the locking; however by design, that is
> not to be done.)
>
> This thread begins at:
>
> scheduler scalability - cgroups, cpusets and load-balancing
> http://lkml.org/lkml/2008/1/29/60
>
> Too bad we didn't think to include you in the CC list of that
> thread from the beginning.

Paul, I actually mentioned at the beginning of my email that I did read that thread
started by Peter. I did learn quite a bit from it :)
You guys did not discuss isolation stuff though. The thread was only about scheduling
and my cpu isolation extension patches deal with other aspects.

Sounds like at this point we're in agreement that sched_load_balance is not suitable
for what I'd like to achieve. But how about making cpusets aware of the cpu_isolated_map ?
Even without my patches it's somewhat of an issue right now. I mean of you use isolcpus=
boot option to put cpus into null domain, cpusets will not be aware of it. The result maybe
a bit confusing if an isolated cpu is added to some cpuset.

Max

Paul Jackson

unread,

Feb 3, 2008, 2:53:49 AM2/3/08

to Max Krasnyansky, a.p.zi...@chello.nl, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Max wrote:
> Paul, I actually mentioned at the beginning of my email that I did read that thread
> started by Peter. I did learn quite a bit from it :)

Ah - sorry - I missed that part. However, I'm still getting the feeling
that there were some key points in that thread that we have not managed
to communicate successfully.

> Sounds like at this point we're in agreement that sched_load_balance is not suitable
> for what I'd like to achieve.

I don't think we're in agreement; I think we're in confusion ;)

Yes, sched_load_balance does not *directly* have anything to do with
this.

But indirectly it is a critical element in what I think you'd like to
achieve. It affects how the cpuset code sets up sched_domains, and
if I understand correctly, you require either (1) some sched_domains to
only contain RT tasks, or (2) some CPUs to be in no sched_domain at all.

Proper configuration of the cpuset hierarchy, including the setting of
the per-cpuset sched_load_balance flag, can provide either of these
sched_domain partitions, as desired.

> But how about making cpusets aware of the cpu_isolated_map ?

No. That's confusing cpusets and the scheduler again.

The cpu_isolated_map is a file static variable known only within
the kernel/sched.c file; this should not change.

Presently, the boot parameter isolcpus= is just used to initialize
what CPUs are isolated at boot, and then the sched_domain partitioning,
as done in kernel/sched.c:partition_sched_domains() (the hook into
the sched code that cpusets uses) determines which CPUs are isolated
from that point forward. I doubt that this should change either.

In that thread referenced above, did you see the part where RT is
achieved not by isolating CPUs from any scheduler, but rather by
polymorphically having several schedulers available to operate on each
sched_domain, and having RT threads self-select the RT scheduler?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@sgi.com> 1.940.382.4214

Max Krasnyansky

unread,

Feb 4, 2008, 1:05:20 AM2/4/08

to Paul Jackson, a.p.zi...@chello.nl, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Paul Jackson wrote:
> Max wrote:
>> Paul, I actually mentioned at the beginning of my email that I did read that thread
>> started by Peter. I did learn quite a bit from it :)
>
> Ah - sorry - I missed that part. However, I'm still getting the feeling
> that there were some key points in that thread that we have not managed
> to communicate successfully.

I think you are assuming that I only need to deal with RT scheduler and scheduler
domains which is not correct. See below.

>> Sounds like at this point we're in agreement that sched_load_balance is not suitable
>> for what I'd like to achieve.
>
> I don't think we're in agreement; I think we're in confusion ;)

Yeah. I don't believe I'm the confused side though ;-)

> Yes, sched_load_balance does not *directly* have anything to do with this.
>
> But indirectly it is a critical element in what I think you'd like to
> achieve. It affects how the cpuset code sets up sched_domains, and
> if I understand correctly, you require either (1) some sched_domains to
> only contain RT tasks, or (2) some CPUs to be in no sched_domain at all.
>
> Proper configuration of the cpuset hierarchy, including the setting of
> the per-cpuset sched_load_balance flag, can provide either of these
> sched_domain partitions, as desired.

Again you're assuming that scheduling domain partitioning satisfies my requirements
or addresses my use case. It does not. See below for more details.

>> But how about making cpusets aware of the cpu_isolated_map ?
>
> No. That's confusing cpusets and the scheduler again.
>
> The cpu_isolated_map is a file static variable known only within
> the kernel/sched.c file; this should not change.

I completely disagree. In fact I think all the cpu_xxx_map (online, present, isolated)
variables do not belong in the scheduler code. I'm thinking of submitting a patch that
factors them out into kernel/cpumask.c We already have cpumask.h.

> Presently, the boot parameter isolcpus= is just used to initialize
> what CPUs are isolated at boot, and then the sched_domain partitioning,
> as done in kernel/sched.c:partition_sched_domains() (the hook into
> the sched code that cpusets uses) determines which CPUs are isolated
> from that point forward. I doubt that this should change either.

Sure, I did not even touch that part. I just proposed to extend the meaning of the
'isolated' bit.

> In that thread referenced above, did you see the part where RT is
> achieved not by isolating CPUs from any scheduler, but rather by
> polymorphically having several schedulers available to operate on each
> sched_domain, and having RT threads self-select the RT scheduler?

Absolutely. Yes that is. I saw that part. But it has nothing to do with my use case.

Looks like I failed to explain what I'm trying to achieve. So let me try again.
I'd like to be able to run a CPU intensive (%100) RT task on one of the processors without
adversely affecting or being affected by the other system activities. System activities
here include _kernel_ activities as well. Hence the proposal is to extend current CPU
isolation feature.

The new definition of the CPU isolation would be:
---
1. Isolated CPU(s) must not be subject to scheduler load balancing
Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts must not be routed to the isolated CPU(s)
User must route interrupts (if any) explicitly.

3. In general kernel subsystems must avoid activity on the isolated CPU(s) as much as possible
Includes workqueues, per CPU threads, etc.
This feature is configurable and is disabled by default.
---

#1 affects scheduler and scheduler domains. It's already supported either by using isolcpus= boot
option or by setting "sched_load_balance" in cpusets. I'm totally happy with the current behavior
and my original patch did not mess with this functionality in any way.

#2 and #3 have _nothing_ to do with the scheduler or scheduler domains. I've been trying to explain
that for a few days now ;-). When you saw my patches for #2 and #3 you told me that you'd be interested
to see them implemented on top of the "sched_load_balance" flag. Here is your original reply
http://marc.info/?l=linux-kernel&m=120153260217699&w=2

So I looked into that and provided an explanation why it would not work or would work but would add
lots of complexity (access to internal cpuset structures, locking, etc).
My email on that is here:
http://marc.info/?l=linux-kernel&m=120180692331461&w=2

Now, I felt from the beginning that cpusets is not the right mechanism to address number #2 and #3.
The best mechanism IMO is to simply provide an access to the cpu_isolated_map to the rest of the kernel.
Again the fact that cpu_isolated_map currently lives in the scheduler code does not change anything
here because as I explained I'm proposing to extend the meaning of the "CPU isolation". I provided
dynamic access to the "isolated" bit only for convince, it does _not_ change existing scheduler/sched
domain/cpuset logic in any way.

Hopefully we're on the same page with regards to the "CPU isolation" now.
If not please let me know what I missed from the earlier discussions or other scheduler related threads.

---

If you think that making cpusets aware of isolated cpus is not the right thing to do that's perfectly
fine by me. I think it'd be better if they were but we can keep things the way they are right now.

Max

Max Krasnyansky

unread,

Feb 4, 2008, 1:55:43 AM2/4/08

to Daniel Walker, Peter Zijlstra, linux-...@vger.kernel.org, Ingo Molnar, Steven Rostedt, Gregory Haskins, Paul Jackson

Hi Daniel,

Sorry for not replying right away.

Daniel Walker wrote:
> On Mon, 2008-01-28 at 16:12 -0800, Max Krasnyanskiy wrote:
>
>> Not accurate enough and way too much overhead for what I need. I know at this point it probably
>> sounds like I'm talking BS :). I wish I've released the engine and examples by now. Anyway let
>> me just say that SW MAC has crazy tight deadlines with lots of small tasks. Using nanosleep() &
>> gettimeofday() is simply not practical. So it's all TSC based with clever time sync logic between
>> HW and SW.
>
> I don't know if it's BS or not, you clearly fixed your own problem which
> is good .. Although when you say "RT patches cannot achieve what I
> needed. Even RTAI/Xenomai can't do that." , and HRT is "Not accurate
> enough and way too much overhead" .. Given the hardware your using,
> that's all difficult to believe.. You also said this code has been
> running on production systems for two year, which means it's at least
> two years old .. There's been some good sized leaps in real time linux
> in the past two years ..

I've been actually tracking RT patches fairly closely. I can't say I tried all of them but I do try
them from time to time. I just got latest 2.6.24-rt1 running on HP xw9300. Looks like it does not handle
CPU hotplug very well, I manged to kill it by bringing cpu 1 off-line. So I cannot run any tests right
now will run some tomorrow.

For now let me mention that I have a simple tests that sleeps for a millisecond, then does some bitbanging
for 200 usec. It measures jitter caused by the periodic scheduler tick, IPIs and other kernel activities.
With high-res timers disabled on most of the machines I mentioned before it shows around 1-1.2usec worst case.
With high-res timers enabled it shows 5-6usec. This is with 2.6.24 running on an isolated CPU. Forget about
using a user-space timer (nanosleep(), etc). Even scheduler tick itself is fairly heavy.
gettimeofday() call on that machine takes on average 2-3usec (not a vsyscall) and SW MAC is all about precise
timing. That's why I said that it's not practical to use that stuff for me. I do not see anything in -rt kernel
that would improve this.

This is btw not to say that -rt kernel is not useful for my app in general. We have a bunch of soft-RT threads
that talk to the MAC thread. Those would definitely benefit. I think cpu isolation + -rt would work beautifully
for wireless basestations.

Max

Paul Jackson

unread,

Feb 4, 2008, 5:55:05 AM2/4/08

to Max Krasnyansky, a.p.zi...@chello.nl, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Max wrote:
> Looks like I failed to explain what I'm trying to achieve. So let me try again.

Well done. I read through that, expecting to disagree or at least
to not understand at some point, and got all the way through nodding
my head in agreement. Good.

Whether the earlier confusions were lack of clarity in the presentation,
or lack of competence in my brain ... well guess I don't want to ask that
question ;).

Well ... just one minor point:

Max wrote in reply to pj:

> > The cpu_isolated_map is a file static variable known only within
> > the kernel/sched.c file; this should not change.
> I completely disagree. In fact I think all the cpu_xxx_map (online, present, isolated)
> variables do not belong in the scheduler code. I'm thinking of submitting a patch that
> factors them out into kernel/cpumask.c We already have cpumask.h.

Huh? Why would you want to do that?

For one thing, the map being discussed here, cpu_isolated_map,
is only used in sched.c, so why publish it wider?

And for another thing, we already declare externs in cpumask.h for
the other, more widely used, cpu_*_map variables cpu_possible_map,
cpu_online_map, and cpu_present_map.

Other than that detail, we seem to be communicating and in agreement on
your first item, isolating CPU scheduler load balancing. Good.

On your other two items, irq and workqueue isolation, which I had
suggested doing via cpuset sched_load_balance, I now agree that that
wasn't a good idea.

I am still a little surprised at using isolation extensions to
disable irqs on select CPUs; but others have thought far more about
irqs than I have, so I'll be quiet.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@sgi.com> 1.940.382.4214

Max Krasnyanskiy

unread,

Feb 4, 2008, 6:20:59 PM2/4/08

to Paul Jackson, a.p.zi...@chello.nl, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Paul Jackson wrote:
> Max wrote:
>> Looks like I failed to explain what I'm trying to achieve. So let me try again.
>
> Well done. I read through that, expecting to disagree or at least
> to not understand at some point, and got all the way through nodding
> my head in agreement. Good.
>
> Whether the earlier confusions were lack of clarity in the presentation,
> or lack of competence in my brain ... well guess I don't want to ask that
> question ;).

:)

> Well ... just one minor point:
>
> Max wrote in reply to pj:
>>> The cpu_isolated_map is a file static variable known only within
>>> the kernel/sched.c file; this should not change.
>> I completely disagree. In fact I think all the cpu_xxx_map (online, present, isolated)
>> variables do not belong in the scheduler code. I'm thinking of submitting a patch that
>> factors them out into kernel/cpumask.c We already have cpumask.h.
>
> Huh? Why would you want to do that?
>
> For one thing, the map being discussed here, cpu_isolated_map,
> is only used in sched.c, so why publish it wider?
>
> And for another thing, we already declare externs in cpumask.h for
> the other, more widely used, cpu_*_map variables cpu_possible_map,
> cpu_online_map, and cpu_present_map.

Well, to address #2 and #3 isolated map will need to be exported as well.
Those other maps do not really have much to do with the scheduler code.
That's why I think either kernel/cpumask.c or kernel/cpu.c is a better place for them.

> Other than that detail, we seem to be communicating and in agreement on
> your first item, isolating CPU scheduler load balancing. Good.
>
> On your other two items, irq and workqueue isolation, which I had
> suggested doing via cpuset sched_load_balance, I now agree that that
> wasn't a good idea.
>
> I am still a little surprised at using isolation extensions to
> disable irqs on select CPUs; but others have thought far more about
> irqs than I have, so I'll be quiet.

Please note that we're not talking about completely disabling IRQs. We're talking about
not routing them to the isolated CPUs by default. It's still possible to explicitly reroute an IRQ
to the isolated CPU.
Why is this needed ? It is actually very easy to explain. IRQs are the major source of latency
and overhead. IRQ handlers themselves are mostly ok but they typically schedule soft irqs, work
queues and timers on the same CPU where an IRQ is handled. In other words if an isolated CPU is
receiving IRQs it's not really isolated, because it's running a whole bunch of different kernel
code (ie we're talking latencies, cache usage, etc).
If course some folks may want to explicitly route certain IRQs to the isolated CPUs. For example
if an app depends on the network stack it may make sense to route an IRQ from the NIC to the same
CPU the app is running on.

Max

Max Krasnyanskiy

unread,

Feb 4, 2008, 7:33:45 PM2/4/08

to Peter Zijlstra, Steven Rostedt, Paul Jackson, LKML, Ingo Molnar, Gregory Haskins

Peter Zijlstra wrote:
> On Mon, 2008-01-28 at 14:00 -0500, Steven Rostedt wrote:
>> On Mon, 28 Jan 2008, Max Krasnyanskiy wrote:
>>>>> [PATCH] [CPUISOL] Support for workqueue isolation
>>>> The thing about workqueues is that they should only be woken on a CPU if
>>>> something on that CPU accessed them. IOW, the workqueue on a CPU handles
>>>> work that was called by something on that CPU. Which means that
>>>> something that high prio task did triggered a workqueue to do some work.
>>>> But this can also be triggered by interrupts, so by keeping interrupts
>>>> off the CPU no workqueue should be activated.
>>> No no no. That's what I though too ;-). The problem is that things like NFS and friends
>>> expect _all_ their workqueue threads to report back when they do certain things like
>>> flushing buffers and stuff. The reason I added this is because my machines were getting
>>> stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even though no IRQs
>>> or other things are running on it.
>> This sounds more like we should fix NFS than add this for all workqueues.
>> Again, we want workqueues to run on the behalf of whatever is running on
>> that CPU, including those tasks that are running on an isolcpu.
>
> agreed, by looking at my top output (and not the nfs code) it looks like
> it just spawns a configurable number of active kernel threads which are
> not cpu bound by in any way. I think just removing the isolated cpus
> from their runnable mask should take care of them.

Peter, Steven,

I think I convinced you guys last time but I did not have a convincing example. So here is some
more info on why workqueues need to be aware of isolated cpus.

Here is how a work queue gets flushed.

static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
{
int active;

if (cwq->thread == current) {
/*
* Probably keventd trying to flush its own queue. So simply run
* it by hand rather than deadlocking.
*/
run_workqueue(cwq);
active = 1;
} else {
struct wq_barrier barr;

active = 0;
spin_lock_irq(&cwq->lock);
if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) {
insert_wq_barrier(cwq, &barr, 1);
active = 1;
}
spin_unlock_irq(&cwq->lock);

if (active)
wait_for_completion(&barr.done);
}

return active;
}

void fastcall flush_workqueue(struct workqueue_struct *wq)
{
const cpumask_t *cpu_map = wq_cpu_map(wq);
int cpu;

might_sleep();
lock_acquire(&wq->lockdep_map, 0, 0, 0, 2, _THIS_IP_);
lock_release(&wq->lockdep_map, 1, _THIS_IP_);
for_each_cpu_mask(cpu, *cpu_map)
flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));
}

In other words it schedules some work on each cpu and expects workqueue thread to run and
trigger the completion. This is what I meant that _all_ threads are expected to report
back even if there is nothing running on that CPU.

So my patch simply makes sure that isolated CPUs are ignored (if work queue isolation is enabled)
that work queue threads are not started on isolated in the CPUs that are isolated.

Max

Paul Jackson

unread,

Feb 4, 2008, 9:47:33 PM2/4/08

to Max Krasnyanskiy, a.p.zi...@chello.nl, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Max K wrote:
> > And for another thing, we already declare externs in cpumask.h for
> > the other, more widely used, cpu_*_map variables cpu_possible_map,
> > cpu_online_map, and cpu_present_map.
> Well, to address #2 and #3 isolated map will need to be exported as well.
> Those other maps do not really have much to do with the scheduler code.
> That's why I think either kernel/cpumask.c or kernel/cpu.c is a better place for them.

Well, if you have need it to be exported for #2 or #3, then that's ok
by me - export it.

I'm unaware of any kernel/cpumask.c. If you meant lib/cpumask.c, then
I'd prefer you not put it there, as lib/cpumask.c just contains the
implementation details of the abstract data type cpumask_t, not any of
its uses. If you mean kernel/cpuset.c, then that's not a good choice
either, as that just contains the implementation details of the cpuset
subsystem. You should usually define such things in one of the files
using it, and unless there is clearly a -better- place to move the
definition, it's usually better to just leave it where it is.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@sgi.com> 1.940.382.4214

Max Krasnyansky

unread,

Feb 4, 2008, 11:10:33 PM2/4/08

to Paul Jackson, a.p.zi...@chello.nl, linux-...@vger.kernel.org, mi...@elte.hu, sros...@redhat.com, ghas...@novell.com

Paul Jackson wrote:
> Max K wrote:
>>> And for another thing, we already declare externs in cpumask.h for
>>> the other, more widely used, cpu_*_map variables cpu_possible_map,
>>> cpu_online_map, and cpu_present_map.
>> Well, to address #2 and #3 isolated map will need to be exported as well.
>> Those other maps do not really have much to do with the scheduler code.
>> That's why I think either kernel/cpumask.c or kernel/cpu.c is a better place for them.
>
> Well, if you have need it to be exported for #2 or #3, then that's ok
> by me - export it.
>
> I'm unaware of any kernel/cpumask.c. If you meant lib/cpumask.c, then
> I'd prefer you not put it there, as lib/cpumask.c just contains the
> implementation details of the abstract data type cpumask_t, not any of
> its uses. If you mean kernel/cpuset.c, then that's not a good choice
> either, as that just contains the implementation details of the cpuset
> subsystem. You should usually define such things in one of the files
> using it, and unless there is clearly a -better- place to move the
> definition, it's usually better to just leave it where it is.

I was thinking of creating the new file kernel/cpumask.c. But it probably does not make sense
just for the masks. I'm now thinking kernel/cpu.c is the best place for it. It contains all
the cpu hotplug logic that deals with those maps at the very top it has stuff like

/* Serializes the updates to cpu_online_map, cpu_present_map */
static DEFINE_MUTEX(cpu_add_remove_lock);

So it seems to make sense to keep the maps in there.

Max

Max Krasnyansky

unread,

Feb 7, 2008, 12:33:56 AM2/7/08

to torv...@linux-foundation.org, Andrew Morton, LKML

Linus, please pull CPU isolation extensions from

git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus

Diffstat:

The patchset consist of 4 patches.
cpuisol: Make cpu isolation configrable and export isolated map
cpuisol: Do not route IRQs to the CPUs isolated at boot
cpuisol: Do not schedule workqueues on the isolated CPUs
cpuisol: Do not halt isolated CPUs with Stop Machine

First two are very simple. They simply make "CPU isolation" a

configurable feature, export cpu_isolated_map and provide some helper functions to access it (just
like cpu_online() and friends).

Last two patches add support for isolating CPUs from running workqueus and stop machine. Last patch
is kind of controversial let me know if you think it's too ugly and I'll resend without it.
For more details see below.

----
This patch series extends CPU isolation support. Yes, most people want to virtuallize

CPUs these days and I want to isolate them :) .

The primary idea here is to be able to use some CPU cores as the dedicated engines for running

user-space code with minimal kernel overhead/intervention, think of it as an SPE in the

Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on one of the

processors without adversely affecting or being affected by the other system activities.
System activities here include _kernel_ activities as well.

I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to

achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf

multi- processor/core systems (vanilla kernel plus these patches) even under exteme system load.

I'm working with legal folks on releasing hard RT user-space framework for that.

I believe with the current multi-core CPU trend we will see more and more applications that
explore this capability: RT gaming engines, simulators, hard RT apps, etc.

Hence the proposal is to extend current CPU isolation feature.
The new definition of the CPU isolation would be:
---
1. Isolated CPU(s) must not be subject to scheduler load balancing
Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts must not be routed to the isolated CPU(s)

User must route interrupts (if any) to those CPUs explicitly.

3. In general kernel subsystems must avoid activity on the isolated CPU(s) as much as possible
Includes workqueues, per CPU threads, etc.
This feature is configurable and is disabled by default.
---

I've been maintaining this stuff since around 2.6.18 and it's been running in production

environment for a couple of years now. It's been tested on all kinds of machines, from NUMA
boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110.
The messiest part used to be SLAB garbage collector changes. With the new SLUB all that mess
goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much better than O(1)
did (ie domains are recomputed dynamically) so that isolation can be done at any time (via sysfs).
So this seems like a good time to merge.

We've had scheduler support for CPU isolation ever since O(1) scheduler went it. In other words
#1 is already supported. These patches do not change/affect that functionality in any way.
#2 is trivial one liner change to the IRQ init code.
#3 is addressed by a couple of separate patches. The main problem here is that RT thread can prevent
kernel threads from running and machine gets stuck because other CPUs are waiting for those threads
to run and report back.

Folks involved in the scheduler/cpuset development provided a lot of feedback on the first series
of patches. I believe I managed to explain and clarify every aspect.
Paul Jackson initially suggested to implement #2 and #3 using cpusets subsystem. Paul and I looked
at it more closely and determined that exporting cpu_isolated_map instead is a better option.

Last patch to the stop machine is potentially unsafe and is marked as highly experimental. Unfortunately
it's currently the only option that allows dynamic module insertion/removal for above scenarios.
If people still feel that it's toooo ugly I can revert that change and keep it in the separate tree
for now.

Thanx

Andrew Morton

unread,

Feb 7, 2008, 12:58:34 AM2/7/08

to Max Krasnyansky, torv...@linux-foundation.org, LKML

On Wed, 06 Feb 2008 21:32:55 -0800 Max Krasnyansky <ma...@qualcomm.com> wrote:

> Linus, please pull CPU isolation extensions from
>
> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus

The feature as a whole seems useful, and I don't actually oppose the merge
based on what I see here. As long as you're really sure that cpusets are
inappropriate (and bear in mind that Paul has a track record of being wrong
on this :)). But I see a few glitches

- There are two separate and identical implementations of
cpu_unusable(cpu). Please do it once, in a header, preferably with C
function, not macros.

- The Kconfig help is a bit scraggly:

+config CPUISOL_STOPMACHINE
+ bool "Do not halt isolated CPUs with Stop Machine (HIGHLY EXPERIMENTAL)"
+ depends on CPUISOL && STOP_MACHINE && EXPERIMENTAL
+ help
+ If this option is enabled kernel will not halt isolated CPUs when Stop Machine

"the kernel"

text is too wide

+ is triggered.
+ Stop Machine is currently only used by the module insertion and removal logic.
+ Please note that at this point this feature is highly experimental and maybe
+ dangerous. It is not known to really brake anything but can potentially
+ introduce an instability.

s/maybe/may be/
s/brake/break/

Neither this text, nor the changelog nor the code comments tell us what the
potential instability with stopmachine *is*? Or maybe I missed it.

- Adding new sysfs files without updating Documentation/ABI/ makes Greg
cry.

- Why is cpu_isolated_map exported to modules? Just for api consistency,
it appears?

pre-existing problems:

- isolated_cpu_setup() has an on-stack array of NR_CPUS integers. This
will consume 4k of stack on ia64 (at least). We'll just squeak through
for a ittle while, but this needs to be fixed. Just move it into
__initdata.

- isolated_cpu_setup() expects that the user can provide an up-to-1024
character kernel boot parameter. Is this reasonable given cpu command
line limits, and given that NR_CPUS will surely grow beyond 1024 in the
future?

Paul Jackson

unread,

Feb 7, 2008, 3:00:45 AM2/7/08

to Andrew Morton, ma...@qualcomm.com, torv...@linux-foundation.org, linux-...@vger.kernel.org

Andrew wrote:
> (and bear in mind that Paul has a track record of being wrong
> on this :))

heh - I saw that <grin>.

Max - Andrew's about right, as usual. You answered my initial
questions on this patch set adequately, but hard real time is
not my expertise, so in the final analysis, other than my saying
I don't have any more objections, my input doesn't mean much
either way.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@sgi.com> 1.940.382.4214

Andrew Morton

unread,

Feb 7, 2008, 3:13:02 AM2/7/08

to Paul Jackson, ma...@qualcomm.com, torv...@linux-foundation.org, linux-...@vger.kernel.org

On Thu, 7 Feb 2008 01:59:54 -0600 Paul Jackson <p...@sgi.com> wrote:

> but hard real time is
> not my expertise

Speaking of which.. there is the -rt tree. Have those people had a look
at the feature, perhaps played with the code?

Max Krasnyansky

unread,

Feb 7, 2008, 12:23:17 PM2/7/08

to Andrew Morton, torv...@linux-foundation.org, LKML

Andrew Morton wrote:
> On Wed, 06 Feb 2008 21:32:55 -0800 Max Krasnyansky <ma...@qualcomm.com> wrote:
>
>> Linus, please pull CPU isolation extensions from
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus
>
> The feature as a whole seems useful, and I don't actually oppose the merge
> based on what I see here.

Awesome :) I think it's get more and more useful as people will start trying
to figure out what the heck there is supposed to do with the spare CPU cores.
I mean pretty soon most machines will have 4 cores and some will have 8.
One way to use those cores is the "dedicated engine" model.

> As long as you're really sure that cpusets are
> inappropriate (and bear in mind that Paul has a track record of being wrong
> on this :)).

I'll cover this in a separate email with more details.

> But I see a few glitches

Good catches. Thanks for reviewing.

> - There are two separate and identical implementations of
> cpu_unusable(cpu). Please do it once, in a header, preferably with C
> function, not macros.

Those are local versions that depend whether a feature is enabled or not.
If CONFIG_CPUISOL_WORKQUEUE is disabled we want to cpu_unusable()
in the workqueue.c to be a noop, and if it's enabled that macro resolve to
cpu_isolated().
Same thing for the stopmachine.c. If CONFIG_CPUISOL_STOPMACHIN is disabled
cpu_unusable() is a noop.
In other words cpu_isolated() is the one common macro that subsystem may
want to stub out.
Do you see another way of doing this ?

> - The Kconfig help is a bit scraggly:
>
> +config CPUISOL_STOPMACHINE
> + bool "Do not halt isolated CPUs with Stop Machine (HIGHLY EXPERIMENTAL)"
> + depends on CPUISOL && STOP_MACHINE && EXPERIMENTAL
> + help
> + If this option is enabled kernel will not halt isolated CPUs when Stop Machine
>
> "the kernel"
>
> text is too wide

Got it. Will fix asap.

> + is triggered.
> + Stop Machine is currently only used by the module insertion and removal logic.
> + Please note that at this point this feature is highly experimental and maybe
> + dangerous. It is not known to really brake anything but can potentially
> + introduce an instability.
>
> s/maybe/may be/
> s/brake/break/

Man, the typos are killing me :). Will fix.

> Neither this text, nor the changelog nor the code comments tell us what the
> potential instability with stopmachine *is*? Or maybe I missed it.

That's the thing, we don't really know :). In real life does not seem to be a problem at all.
As I mentioned in prev emails. We've been running all kinds of machines with this enabled,
and inserting all kinds of modules left and right. Never seen any crashes or anything.
But the fact that stopmachine is supposed to halt all cpus during module insertion/removal
seems to imply that something bad may happen if some cpus are not halted. It may very well
turnout that it's no longer needed because our locking and refcounting handles this just fine.
I mean ideally we should not have to halt the entire box, it causes terrible latencies.

> - Adding new sysfs files without updating Documentation/ABI/ makes Greg cry.

Oh, did not know that. Will fix.

>
> - Why is cpu_isolated_map exported to modules? Just for api consistency, it appears?

Yes. For consistency. We'd want cpu_isolated() to work everywhere.

> pre-existing problems:
>
> - isolated_cpu_setup() has an on-stack array of NR_CPUS integers. This
> will consume 4k of stack on ia64 (at least). We'll just squeak through
> for a ittle while, but this needs to be fixed. Just move it into
> __initdata.

Will do.

> - isolated_cpu_setup() expects that the user can provide an up-to-1024
> character kernel boot parameter. Is this reasonable given cpu command
> line limits, and given that NR_CPUS will surely grow beyond 1024 in the
> future?

I'm thinking that is reasonable for now.

I'll fix and resend the patches asap.

Thanx
Max

Max Krasnyansky

unread,

Feb 7, 2008, 12:36:48 PM2/7/08

to Linus Torvalds, Andrew Morton, LKML

Hi Linus,

Linus Torvalds wrote:

>
> On Wed, 6 Feb 2008, Max Krasnyansky wrote:
>> Linus, please pull CPU isolation extensions from
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus
>

> Have these been in -mm and widely discussed etc? I'd like to start more
> carefully, and (a) have that controversial last patch not merged initially
> and (b) make sure everybody is on the same page wrt this all..

They've been discussed with RT/scheduler/cpuset folks.
Andrew is definitely in the loop. He just replied and asked for some fixes and
clarifications. He seems to be ok with merging this in general.

The last patch may not be as bad as I originally thought. We'll discuss it some
more with Andrew. I'll also check with Rusty who wrote the stopmachine in the
first place. It actually seems like an overkill at this point. My impression is
that it was supposed to be a safety net if some refcounting/locking is not fully
safe and may not be needed or as critical anymore.
I'm maybe wrong of course. So I'll find that out :)

Max Krasnyansky

unread,

Feb 7, 2008, 1:03:56 PM2/7/08

to Paul Jackson, Andrew Morton, torv...@linux-foundation.org, linux-...@vger.kernel.org

Paul Jackson wrote:
> Andrew wrote:
>> (and bear in mind that Paul has a track record of being wrong
>> on this :))
>
> heh - I saw that <grin>.
>
> Max - Andrew's about right, as usual. You answered my initial
> questions on this patch set adequately, but hard real time is
> not my expertise, so in the final analysis, other than my saying
> I don't have any more objections, my input doesn't mean much
> either way.

I honestly think this one is no brainer and I do not think this one will hurt Paul's track record :).
Paul initially disagreed with me and that's when he was wrong ;-))

Andrew, I looked at this in detail and here is an explanation that
I sent to Paul a few days ago (a bit shortened/updated version).

--------

I thought some more about your proposal to use sched_load_balance flag in cpusets instead of extending

cpu_isolated_map. I looked at the cpusets, cgroups and here are my thoughts on this.
Here is the list of issues with sched_load_balance flag from CPU isolation perspective:

--
(1) Boot time isolation is not possible. There is currently no way to setup a cpuset at
boot time. For example we won't be able to isolate cpus from irqs and workqueues at boot.
Not a major issue but still an inconvenience.

--
(2) There is currently no easy way to figure out what cpuset a cpu belongs to in order to query

it's sched_load_balance flag. In order to do that we need a method that iterates all active cpusets

and checks their cpus_allowed masks. This implies holding cgroup and cpuset mutexes. It's not clear
whether it's ok to do that from the the contexts CPU isolation happens in (apic, sched, workqueue).
It seems that cgroup/cpuset api is designed from top down access. ie adding a cpu to a set and then
recomputing domains. Which makes perfect sense for the common cpuset usecase but is not what cpu
isolation needs.
In other words I think it's much simpler and cleaner to use the cpu_isolated_map for isolation

purposes. No locks, no races, etc.

--
(3) cpusets are a bit too dynamic :) . What I mean by this is that sched_load_balance flag

can be changed at any time without bringing a CPU offline. What that means is that we'll
need some notifier mechanisms for killing and restarting workqueue threads when that flag changes.
Also we'd need some logic that makes sure that a user does not disable load balancing on all cpus
because that effectively will kill workqueues on all the cpus.
This particular case is already handled very nicely in my patches. Isolated bit can be set
only when cpu is offline and it cannot be set on the first online cpu. Workqueus and other
subsystems already handle cpu hotplug events nicely and can easily ignore isolated cpus when
they come online.

--

#1 is probably unfixable. #2 and #3 can be fixed but at the expense of extra complexity across
the board. I seriously doubt that I'll be able to push that through the reviews ;-).

Also personally I still think cpusets and cpu isolation attack two different problems. cpusets is about
partitioning cpus and memory nodes, and managing tasks. Most of the cgroups/cpuset APIs are designed to
deal with tasks.
CPU isolation is much simpler and is at the lower layer. It deals with IRQs, kernel per cpu threads, etc.

The only intersection I see is that both features affect scheduling domains. CPU isolation is again
simple here it uses existing logic in sched.c it does not change anything in this area.

---------

Andrew, hopefully that clarifies it. Let me know if you're not convinced.

Max

Paul Jackson

unread,

Feb 7, 2008, 1:11:16 PM2/7/08

to Max Krasnyansky, ak...@linux-foundation.org, torv...@linux-foundation.org, linux-...@vger.kernel.org

Max - Andrew wondered if the rt tree had seen the
code or commented it on it. What became of that?

My two cents isn't worth a plug nickel here, but
I'm inclined to nod in agreement when Linus wants
to see these patches get some more exposure before
going into Linus's tree. ... what's the hurry?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@sgi.com> 1.940.382.4214

Max Krasnyansky

unread,

Feb 7, 2008, 1:15:19 PM2/7/08

to Andrew Morton, Paul Jackson, torv...@linux-foundation.org, linux-...@vger.kernel.org

Andrew Morton wrote:
> On Thu, 7 Feb 2008 01:59:54 -0600 Paul Jackson <p...@sgi.com> wrote:
>
>> but hard real time is not my expertise
>
> Speaking of which.. there is the -rt tree. Have those people had a look
> at the feature, perhaps played with the code?

Peter Z. and Steven R. sent me some comments, I believe I explained and addressed them.
Ingo's been quite. Probably too busy.

btw It's not an RT feature per se. It certainly helps RT but removing all the latency
sources from isolated CPUs. But in general it's just "reducing kernel overhead on some CPUs"
kind of feature.

Max

Max Krasnyansky

unread,

Feb 7, 2008, 1:28:45 PM2/7/08

to Paul Jackson, ak...@linux-foundation.org, torv...@linux-foundation.org, linux-...@vger.kernel.org

Paul Jackson wrote:
> Max - Andrew wondered if the rt tree had seen the
> code or commented it on it. What became of that?

I just replied to Andrew. It's not an RT feature per se.
And yes Peter CC'ed RT folks. You probably did not get a chance to read all replies.
They had some questions/concerns and stuff. I believe I answered/clarified all of them.

> My two cents isn't worth a plug nickel here, but
> I'm inclined to nod in agreement when Linus wants
> to see these patches get some more exposure before
> going into Linus's tree. ... what's the hurry?

No hurry I guess. I did mentioned in the introductory email that I've been maintaining
this stuff for awhile now. SLAB patches used to be messy, with new SLUB the mess goes away.
CFS handles CPU hotplug much better than O(1), cpu hotplug is needed to be able to change
isolated bit from sysfs. That's why I think it's a good time to merge.
I don't mind of course if we put this stuff in -mm first. Although first part of the patchset
(ie exporting isolated map, sysfs interface, etc) seem very simple and totally not controversial.
Stop machine patch is really the only thing that may look suspicious.

Max

Andrew Morton

unread,

Feb 7, 2008, 2:27:50 PM2/7/08

to Max Krasnyansky, torv...@linux-foundation.org, LKML

On Thu, 07 Feb 2008 09:22:34 -0800 Max Krasnyansky <ma...@qualcomm.com> wrote:

> > - There are two separate and identical implementations of
> > cpu_unusable(cpu). Please do it once, in a header, preferably with C
> > function, not macros.
>
> Those are local versions that depend whether a feature is enabled or not.
> If CONFIG_CPUISOL_WORKQUEUE is disabled we want to cpu_unusable()
> in the workqueue.c to be a noop, and if it's enabled that macro resolve to
> cpu_isolated().
> Same thing for the stopmachine.c. If CONFIG_CPUISOL_STOPMACHIN is disabled
> cpu_unusable() is a noop.
> In other words cpu_isolated() is the one common macro that subsystem may
> want to stub out.
> Do you see another way of doing this ?

ah, I missed that. Yup, the implementation you have there looks OK.

Ingo Molnar

unread,

Feb 7, 2008, 2:52:30 PM2/7/08

to Linus Torvalds, Max Krasnyansky, Andrew Morton, LKML

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> On Wed, 6 Feb 2008, Max Krasnyansky wrote:
> >

> > Linus, please pull CPU isolation extensions from
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git
> > for-linus
>

> Have these been in -mm and widely discussed etc? I'd like to start
> more carefully, and (a) have that controversial last patch not merged
> initially and (b) make sure everybody is on the same page wrt this
> all..

no, they have not been under nearly enough testing and review - these
patches surfaced on lkml for the first time one week ago (!). I find the
pull request totally premature, this stuff has not been discussed and
agreed on _at all_. None of the people who maintain and have interest in
this code and participated in the (short) one-week discussion were
Cc:-ed to the pull request.

I think these patches also need a buy-in from Peter Zijlstra and Paul
Jackson (or really good reasoning while any objections from them should
be overriden) - all of whom deal with the code affected by these changes
on a daily basis and have an interest in CPU isolation features.

Generally i think that cpusets is actually the feature and API that
should be used (and extended) for CPU isolation - and we already
extended it recently in the direction of CPU isolation. Most enterprise
distros have cpusets enabled so it's in use. Also, cpusets has the
appeal of being commonly used in the "big honking boxes" arena, so
reusing the same concept for RT and virtualization stuff would be the
natural approach. It already ties in to the scheduler domains code
dynamically and is flexible and scalable. I resisted ad-hoc CPU
isolation patches in -rt for that reason. Also, i'd not mind some
test-coverage in sched.git as well.

Ingo

Max Krasnyansky

unread,

Feb 7, 2008, 7:41:32 PM2/7/08

to Ingo Molnar, Linus Torvalds, Andrew Morton, LKML

Hi Ingo,

Thanks for your reply.

> * Linus Torvalds <torv...@linux-foundation.org> wrote:
>
>> On Wed, 6 Feb 2008, Max Krasnyansky wrote:
>>> Linus, please pull CPU isolation extensions from
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git
>>> for-linus
>> Have these been in -mm and widely discussed etc? I'd like to start
>> more carefully, and (a) have that controversial last patch not merged
>> initially and (b) make sure everybody is on the same page wrt this
>> all..
>
> no, they have not been under nearly enough testing and review - these
> patches surfaced on lkml for the first time one week ago (!).

Almost two weeks actually. Ok 1.8 :)

> I find the pull request totally premature, this stuff has not been discussed and
> agreed on _at all_.

Ingo, I may have the wrong impression but my impression is that you ignored all the
other emails and just read Linus' reply. I do not believe this accusation is valid.
I apologize if my impression is incorrect.
Since the patches _do not_ change/affect existing scheduler/cpuset functionality I did
not know who to CC in the first email that I sent. Luckily Peter picked it up and CC'ed
a bunch of folks, including Paul, Steven and You.
All of them replied and had questions/concerns. As I mentioned before I believe I addressed
all of them.

> None of the people who maintain and have interest in
> this code and participated in the (short) one-week discussion were
> Cc:-ed to the pull request.

Ok. I did not realize I'm supposed to do that.
Since I got no replies to the second round of patches (take 2), which again was CC'ed to
the same people that Peter CC'ed. I assumed that people are ok with it. That's what discussion
on the first take ended with.

> I think these patches also need a buy-in from Peter Zijlstra and Paul
> Jackson (or really good reasoning while any objections from them should
> be overriden) - all of whom deal with the code affected by these changes
> on a daily basis and have an interest in CPU isolation features.

See above.
Following issues were raised:
1. Peter and Steven initially thought that workqueue isolation is not needed.
2. Paul thought that it should be implemented on top of cpusets.
3. Peter thought that stopmachine change is not safe.
There were a couple of other minor misunderstandings (for example Peter thought
that I'm completely disallowing IRQs on isolated CPUs, which is obviously not
the case). I clarified all of them.

#1 I explained in the original thread and then followed up with concrete code example
of why it is needed.
http://marc.info/?l=linux-kernel&m=120217173001671&w=2
Got no replies so far. So I'm assuming folks are happy.

#2 I started a separate thread on that
http://marc.info/?l=linux-kernel&m=120180692331461&w=2
The conclusion was, well let me just quote exactly what Paul had said:
----

> Paul Jackson wrote:
>> Max wrote:
>>> Looks like I failed to explain what I'm trying to achieve. So let me try again.
>>
>> Well done. I read through that, expecting to disagree or at least
>> to not understand at some point, and got all the way through nodding
>> my head in agreement. Good.
>>
>> Whether the earlier confusions were lack of clarity in the presentation,
>> or lack of competence in my brain ... well guess I don't want to ask that
>> question ;).

----

And #3 Peter did not agree with me but said that it's up to Linus or Andrew to decide
whether it's appropriate in mainline or not. I _clearly_ indicated that this part is
somewhat controversial and maybe dangerous, I'm _not_ trying to sneak something in.
Andrew picked it up and I'm going to do some more investigation on whether it's really
not safe or is actually fine (about to send an email to Rusty).

> Generally i think that cpusets is actually the feature and API that
> should be used (and extended) for CPU isolation - and we already
> extended it recently in the direction of CPU isolation. Most enterprise
> distros have cpusets enabled so it's in use. Also, cpusets has the
> appeal of being commonly used in the "big honking boxes" arena, so
> reusing the same concept for RT and virtualization stuff would be the
> natural approach. It already ties in to the scheduler domains code
> dynamically and is flexible and scalable. I resisted ad-hoc CPU
> isolation patches in -rt for that reason.

That's exactly what Paul proposed initially. I completely disagree with that but I did look
at it in _detail_.
Please take a look here for detailed explanation
http://marc.info/?l=linux-kernel&m=120180692331461&w=2
This email getting to long and I did not want to inline everything.

> Also, i'd not mind some test-coverage in sched.git as well.

I believe it has _nothing_ to do with the "scheduler" but I do not mind it being in that tree.
Please read this email on why it has nothing to do with the scheduler
http://marc.info/?l=linux-kernel&m=120210515323578&w=2
That's the email that convinced Paul.

To sum it up. It has been discussed with the right people. I do not believe that pull
request was premature. In fact I think we're making a bigger deal out of these simple
changes than it should be. At the end of the day those features are disabled by default
and do _not_ affect _anything_. But like I said I'll play by the rules. So ...

Next step for me is to address Andrew's comments, I'll resent the patches with those.
And to follow up with Andrew and Rusty on the stopmachine thing.

Thanks for your reply.
Max