affinity and tasklets...

Ashok Raj

unread,

Dec 17, 2001, 1:37:33 PM12/17/01

to

Hello

when i call tasklet_schedule() to do deferred processing, looking at the
code it seems like it queues it to the current processor. So typically when
a cpu takes interrupt and the isr queues a tasklet, it gets queued to the
same processor that took the interrupt.

Assume iam running many protocols on a single piece of hw using the same
physical fibre link. Say running Network and storage etc. Most Virtual
Interface Hardware, and the new IB technologies allow doing this.

The way the interrupt processing is done is a little bit different, in the
sense the intr are ganged. e.g it could report 2 different drivers have
completions to process. Now both of those can be processed in parallel,
since they share no data.

In a MP case, we would like 2 separate processors taking the completion
processing. But running tasklets dont seem to suit this since it basically
queues on the same CPU that is currently running, and this means both get
queued to the same tasklet_vec[cpu]. But i want each to run on a separate
CPU. is using softirq the right method? or could i have cpu affinity for
tasklets? (i know there is afficinity for interrupts, but iam not aware of
this for tasklets.)

any help is greatly appreciated.

ashokraj

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Ingo Molnar

unread,

Dec 22, 2001, 4:30:50 AM12/22/01

to

On Mon, 17 Dec 2001, Ashok Raj wrote:

> In a MP case, we would like 2 separate processors taking the
> completion processing. But running tasklets dont seem to suit this
> since it basically queues on the same CPU that is currently running,
> and this means both get queued to the same tasklet_vec[cpu]. But i
> want each to run on a separate CPU. is using softirq the right method?
> or could i have cpu affinity for tasklets? (i know there is afficinity
> for interrupts, but iam not aware of this for tasklets.)

you'll get a natural affinity of tasklets: they will run on the processor
where the tasklet got activated. Tasklets are just a special form of
softirqs, they have no context in the classic task sense, the only
difference they have to softirqs is that the tasklet code guarantees
single-threadedness of the function executed.

if you are going to rely on tasklets for good SMP scalability then i'd
suggest using a separate tasklet for every device IRQ. Then bind hardirqs
to a particular CPU - thus both the hardirq, the softirq/tasklet will run
on the same processor.

Ingo

Ashok Raj

unread,

Dec 22, 2001, 12:20:01 PM12/22/01

to

Thanks for the response Ingo.

The natual affinity of tasklet execution is really the one iam trying to get
away from.

i.e in our devices, a single interrupt from our device indicates several
device virtual interrupts, so even if i have several tasklets for each
virtual device interrupts, the code that runs the real intr and schedules
tasklets will end up queueing all of them on a single cpu.

for e.g if i have 3 virtual device interrupts happen and they are all
indicated by a single real intr to the device. All 3 tasklets would be
queued to the same CPU.

cpu 0
-----
intr()
queue tasklet_1
queue tasklet_2
queue tasklet_3

since tasklet 1,2 & 3 are totally independent virtual interrupts, we would
just kick 1,2,3 on different cpu's queues. even better, if there is any load
balencing, so each tasklet code running on a separate cpu's could pickup one
when they are done processing the current work.

Alan Cox

unread,

Dec 22, 2001, 5:57:45 PM12/22/01

to

> i.e in our devices, a single interrupt from our device indicates several
> device virtual interrupts, so even if i have several tasklets for each
> virtual device interrupts, the code that runs the real intr and schedules
> tasklets will end up queueing all of them on a single cpu.

Why do you care. Unless your interrupt event handling code is seriously slow
surely you want to run the things serially, efficiently and while the cache
is hot ?

Ashok Raj

unread,

Dec 23, 2001, 11:40:03 AM12/23/01

to

#3 seems like the best fit, and we can load balance to some extent ourself.

#2: You got it right. The hw is designed to generate a fewer # of
interrupts, since the information necessary is available in other means, and
there is a lot of time saved by not taking the interrupt.

I will give #3 a try and let you folks know.

ashokr

-----Original Message-----
From: mi...@localhost.localdomain [mailto:mi...@localhost.localdomain]On
Behalf Of Ingo Molnar
Sent: Saturday, December 22, 2001 2:35 PM
To: Ashok Raj
Cc: linux-...@vger.kernel.org

Subject: RE: affinity and tasklets...

On Sat, 22 Dec 2001, Ashok Raj wrote:

> The natual affinity of tasklet execution is really the one iam trying
> to get away from.

some form of interrupt source is needed to load-balance IRQ load to other
CPUs - some other, unrelated processor wont just start executing the
necessery function, without that CPU getting interrupted in some way.
(polling is an option too, but that's out of question for a generic
solution.)

there are a number of solutions to this problem.

0) is it truly necessery to process the 3 virtual devices in parallel? Are
they independent and is the processing needed heavy enough that it demands
distribution between CPUs?

1) the hardware could generate real IRQs for the virtual devices too,
which would get load-balanced automatically. I suspect this is not an
option in your case, right?

2) the 'hard' IRQ you generate could be broadcasted to multiple CPUs at
once. Your IRQ handler would have the target CPU number hardcoded. This is
pretty inflexible and needs some lowlevel APIC code changes.

3) upon receiving the hard-IRQ, you could also trigger execution on other
CPUs, via smp_call_function().

i think #3 is the most generic solution. You'll have to do the
load-balancing by determining the target CPU of smp_call_function().

Ingo

Ashok Raj

unread,

Dec 23, 2001, 11:55:41 AM12/23/01

to

-----Original Message-----
From: linux-ker...@vger.kernel.org
[mailto:linux-ker...@vger.kernel.org]On Behalf Of Alan Cox
Sent: Saturday, December 22, 2001 3:04 PM
To: Ashok Raj

Cc: mi...@elte.hu; linux-...@vger.kernel.org
Subject: Re: affinity and tasklets...

> i.e in our devices, a single interrupt from our device indicates several
> device virtual interrupts, so even if i have several tasklets for each
> virtual device interrupts, the code that runs the real intr and schedules
> tasklets will end up queueing all of them on a single cpu.

>> Why do you care. Unless your interrupt event handling code is seriously
slow
>> surely you want to run the things serially, efficiently and while the
cache
>> is hot ?

This is based on our observation with existing hw, when we run several
protocols through this single device, (storage SCSI traffic, LAN, IPC with
lots of short and very large messages) we see even in a 8 way system, just
one CPU pegged. All the handling are totally independent, storage traffic
has no needs to be serialized with LAN traffic and vice versa.

The reasons are the following why we are looking beyond what is available.

1. More protocols on a single fibre via same device.
2. Device does not stop transferring data when the interrupts are generated.
Because it puts additional interrupt conditions in the event queue and
generates intr only when all existing events are processed. So assume the
following case.

- Network Completion happened (Signalled in Event queue) (Real intr gets
asserted)
- IPC Completion happened (Signalled in event queue)
- Real interrupt gets serviced

- Now we queue a tasklet for Network, then for the IPC

- Lets say network has about 100 receives ready to process. Because they
are before the IPC tasklet, IPC processing is completely held until the
network processing is complete, while the system has 7 more CPUs doing
nothing.

ashokr

Ingo Molnar

unread,

Dec 24, 2001, 3:01:28 AM12/24/01

to

On Sun, 23 Dec 2001, Ashok Raj wrote:

> #2: You got it right. The hw is designed to generate a fewer # of
> interrupts, since the information necessary is available in other
> means, and there is a lot of time saved by not taking the interrupt.

point is, there is no time saved by not taking the interrupt. In fact it's
slightly more expensive to use smp_call_function() instead of getting the
proper hardware interrupts. (because there is some setup cost of inter-CPU
APIC interrupts, and also you have to load-balance manually.)

the IRQ latency itself does not show up as lost CPU time on modern IRQ
delivery systems - while the IRQ latency is around 5-10 microseconds, the
true 'null interrupt cost' is only around 1-3 microseconds on 8-way
systems. And by generating cross-CPU interrupts for smp_call_function()
there are almost no savings anyway - they are normal interrupts and have
similar IRQ entry overhead as hardware-IRQs.

so it's almost always the better idea to use multiple interrupts if there
are multiple, more or less orthogonal devices. There might be cases where
it's the best solution to keep a single IRQ source (eg. hw simplicity, or
conserving IRQ space) - but almost never for performance reasons,
software-driven IRQ distribution is not going to be more efficient than a
hardware-based one.