Processor Affinity in the age of unbounded cores

MitchAlsup

unread,

Aug 23, 2021, 2:24:05 PM8/23/21

to

I have been giving some thought to processor affinity. But I want to
look at the future where there may be an essentially unbounded
number of cores.
<
The current model of 1-bit to denote "any core" and then a bit vector
to say "any of these cores can run this task/thread" falls apart when
number of cores is bigger than 32-cores (or 64-cores or 128-cores).
The bit vector approach "does not scale".
<
So, I ask: what would a programming model for assigning processor
affinity when the number of cores is not easily mapped into common
register widths ?

Stephen Fuld

unread,

Aug 23, 2021, 3:01:37 PM8/23/21

to

Interesting question. I think we should take a step back and determine
what problem processor affinity is needed/desirable to solve. I can
think of four, though there are probably others.

1. A process needs to run on a particular specific core, say to run some
kind of diagnostic.

2. All the cores are not equivalent. Some could be faster or slower, or
some may have particular extensions that other don't have.

3. Some cores share some level of cache with some, but not all other
cores. So you may want a process to be in the same group or a different
group of cores from some other process, depending upon their cache usage
pattern.

4. There may be some other characteristic of the processes that is
relevant, such as one process produces an intermediate result used by
another process, so there is no use having the second process bidding
for another core and thus potentially using resources needlessly. I am
somewhat fuzzy about this.

It may be that there are different solutions for some of these. For
example, for the first, the program could use a "core number" rather
than a bit mask. For the second, you could have the OS define "groups"
each with particular characteristics. The a process would request a
"group number", and be assigned to any processor in that group.

For three and four, the software could specify "same core as process
"X", or "different core from process "Y". This get the applications
software out of specifying physical core sets.

Obviously, this is pretty much off the top of my head, and needs a lot
of "refinement". :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

EricP

unread,

Aug 23, 2021, 3:23:16 PM8/23/21

to

Physical memory is allocated to minimize the distance between a
thread's preferred core, and that cores memory or neighbor memory.
The bit vector model does not take NUMA distance to memory into account.

A thread would be allocated NUMA memory as close as possible,
and would continue to use that memory until it is recycled.

Having two or more threads with shared process address space running on
different cores raises the question which NUMA node should the memory
be allocated on? Presumably each thread's stack would allocate on
the current core.

Moving thread T1 from core C0 to core C1 might have a shared L2,
to core C3 have 1 hop back to the memory previously allocated on C0,
to core C4 have 2 hops back to previous memory allocated on C0.
This extra hop cost continues until physical pages are recycled.

Moving a thread between cores also usually does not take into account
the loss of investment in the cache, and the delay to re-charge the cache.
Just guessing, but capacitor charge-discharge model probably applies.

This suggests a hierarchy of costs to move up, across, and then down
between cores for load balancing to take into account.

MitchAlsup

unread,

Aug 23, 2021, 3:58:09 PM8/23/21

to

On Monday, August 23, 2021 at 2:23:16 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > I have been giving some thought to processor affinity. But I want to
> > look at the future where there may be an essentially unbounded
> > number of cores.
> > <
> > The current model of 1-bit to denote "any core" and then a bit vector
> > to say "any of these cores can run this task/thread" falls apart when
> > number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> > The bit vector approach "does not scale".
> > <
> > So, I ask: what would a programming model for assigning processor
> > affinity when the number of cores is not easily mapped into common
> > register widths ?
<
> Physical memory is allocated to minimize the distance between a
> thread's preferred core, and that cores memory or neighbor memory.
> The bit vector model does not take NUMA distance to memory into account.
>
> A thread would be allocated NUMA memory as close as possible,
> and would continue to use that memory until it is recycled.
<

So, would an I/O driver (SATA) want to be closer to his memory or closer
to his I/O (Host Bridge) ?
<
Would interrupt from the Host Bridge be directed to a core in this "Node" ?
So, could the device driver be situated close to its memory but its interrupt
handler closer to the device ?
<
Would a device page fault be directed at the interrupt handler or at I/O
page fault handling driver ?

>
> Having two or more threads with shared process address space running on
> different cores raises the question which NUMA node should the memory
> be allocated on? Presumably each thread's stack would allocate on
> the current core.
>
> Moving thread T1 from core C0 to core C1 might have a shared L2,
> to core C3 have 1 hop back to the memory previously allocated on C0,
> to core C4 have 2 hops back to previous memory allocated on C0.
> This extra hop cost continues until physical pages are recycled.
>
> Moving a thread between cores also usually does not take into account
> the loss of investment in the cache, and the delay to re-charge the cache.
> Just guessing, but capacitor charge-discharge model probably applies.
>
> This suggests a hierarchy of costs to move up, across, and then down
> between cores for load balancing to take into account.
<

Thanks for this observation.
<
Now what about multi-threaded cores, the cache cost of migrating around a
MT core is low, but the throughput could be lower due to both (or many) cores
sharing execution resources?

MitchAlsup

unread,

Aug 23, 2021, 4:03:18 PM8/23/21

to

On Monday, August 23, 2021 at 2:01:37 PM UTC-5, Stephen Fuld wrote:
> On 8/23/2021 11:24 AM, MitchAlsup wrote:
> > I have been giving some thought to processor affinity. But I want to
> > look at the future where there may be an essentially unbounded
> > number of cores.
> > <
> > The current model of 1-bit to denote "any core" and then a bit vector
> > to say "any of these cores can run this task/thread" falls apart when
> > number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> > The bit vector approach "does not scale".
> > <
> > So, I ask: what would a programming model for assigning processor
> > affinity when the number of cores is not easily mapped into common
> > register widths ?
<
> Interesting question. I think we should take a step back and determine
> what problem processor affinity is needed/desirable to solve. I can
> think of four, though there are probably others.
>
> 1. A process needs to run on a particular specific core, say to run some
> kind of diagnostic.
<

Is pinning merely a subset of affinity ?

>
> 2. All the cores are not equivalent. Some could be faster or slower, or
> some may have particular extensions that other don't have.
<

This begs the question of powering back on various cores and the
latency to do just such!

>
> 3. Some cores share some level of cache with some, but not all other
> cores. So you may want a process to be in the same group or a different
> group of cores from some other process, depending upon their cache usage
> pattern.
<

This is similar to EricP's hierarchy observation.

>
> 4. There may be some other characteristic of the processes that is
> relevant, such as one process produces an intermediate result used by
> another process, so there is no use having the second process bidding
> for another core and thus potentially using resources needlessly. I am
> somewhat fuzzy about this.
<

And all sorts of transitivity issues between threads/tasks.

>
> It may be that there are different solutions for some of these. For
> example, for the first, the program could use a "core number" rather
> than a bit mask. For the second, you could have the OS define "groups"
> each with particular characteristics. The a process would request a
> "group number", and be assigned to any processor in that group.
<

This gets at the jist of my question:: what is the right model to
give users to control their machines which are vastly more core
intensive that we are using today ?

>
> For three and four, the software could specify "same core as process
> "X", or "different core from process "Y". This get the applications
> software out of specifying physical core sets.
<

But how does a user express this ? and how does an OS/HV do this ?

>
> Obviously, this is pretty much off the top of my head, and needs a lot
> of "refinement". :-)
<

Thanks for a go at it.

Stephen Fuld

unread,

Aug 23, 2021, 6:22:06 PM8/23/21

to

On 8/23/2021 1:03 PM, MitchAlsup wrote:
> On Monday, August 23, 2021 at 2:01:37 PM UTC-5, Stephen Fuld wrote:
>> On 8/23/2021 11:24 AM, MitchAlsup wrote:
>>> I have been giving some thought to processor affinity. But I want to
>>> look at the future where there may be an essentially unbounded
>>> number of cores.
>>> <
>>> The current model of 1-bit to denote "any core" and then a bit vector
>>> to say "any of these cores can run this task/thread" falls apart when
>>> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
>>> The bit vector approach "does not scale".
>>> <
>>> So, I ask: what would a programming model for assigning processor
>>> affinity when the number of cores is not easily mapped into common
>>> register widths ?
> <
>> Interesting question. I think we should take a step back and determine
>> what problem processor affinity is needed/desirable to solve. I can
>> think of four, though there are probably others.
>>
>> 1. A process needs to run on a particular specific core, say to run some
>> kind of diagnostic.
> <
> Is pinning merely a subset of affinity ?

No. I am thinking of a "group" concept. See below. I suppose you
could have the OS define each processor as a single member group. That
might make sense.

>>
>> 2. All the cores are not equivalent. Some could be faster or slower, or
>> some may have particular extensions that other don't have.
> <
> This begs the question of powering back on various cores and the
> latency to do just such!

The OS would pre define each processor type as a "group". i.e. GBOoO
processors, small in order processors, processors with a specific extra
functionality, etc. These would be well known to the application. The
the application would do some sort of syscall to specify to which group
it wants affinity.

>>
>> 3. Some cores share some level of cache with some, but not all other
>> cores. So you may want a process to be in the same group or a different
>> group of cores from some other process, depending upon their cache usage
>> pattern.
> <
> This is similar to EricP's hierarchy observation.
>>
>> 4. There may be some other characteristic of the processes that is
>> relevant, such as one process produces an intermediate result used by
>> another process, so there is no use having the second process bidding
>> for another core and thus potentially using resources needlessly. I am
>> somewhat fuzzy about this.
> <
> And all sorts of transitivity issues between threads/tasks.

Here, I would allow user processes to ask the OS to create a new group.
Then other processes could ask the OS to join that group. Processes
in the same group would share an affinity. Alternatively, a process
could ask to be scheduled "away" from a particular group.

The idea is that individual processes don't need your bitmap of cores,
just a group name. The OS maps group membership to specific cores as
needed.

JimBrakefield

unread,

Aug 23, 2021, 6:25:42 PM8/23/21

to

Here's my go at it (explained as simply as possible):
Intrinsically going for a bottom-up approach, start with the communications
network. Store and forward (SNF) at one clock per step. For a modest
network of cores, a squashed torus is used. Take a fat torus, squash, fold
in half in both the X and Y directions. The packet size is of the order of 256
bits or so. For a large network, a 3D torus is needed. Take the 3rd dimension
and squash the cores into a small square. Say 4096 cores total, so 16 cores
is a small square. The SNF network needs to hop the 16 cores in a single
clock cycle. Core numbering is such that 3D blocks of any size or aspect ratios
can be specified. A packet can be sent to any size 3d block. This is a
mechanism similar to an OoO machine functional unit broadcasting it's result.

Ok, so how are cores allocated? A core requests a block of cores. To minimize
store and forward jumps the block should be as cubic as possible and as
near as possible. Perhaps this is something the SNF can perform?

These ideas resulted from a study for a NN machine for spiking neurons using
FPGA sticks for each node.

Terje Mathisen

unread,

Aug 24, 2021, 3:54:08 AM8/24/21

to

Same answer as for network routing tables and huge core count clusters?

I.e. directories so that only the interested cores need to bother. This
works well as long as you don't have many threads that you want to allow
to run "on any core except the first and last". I.e. you need either
"any core" as a single bit, or "small_list_of_cores".

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

pec...@gmail.com

unread,

Aug 24, 2021, 6:41:23 AM8/24/21

to

poniedziałek, 23 sierpnia 2021 o 20:24:05 UTC+2 MitchAlsup napisał(a):
> I have been giving some thought to processor affinity. But I want to
> look at the future where there may be an essentially unbounded
> number of cores.
> <
> The current model of 1-bit to denote "any core" and then a bit vector
> to say "any of these cores can run this task/thread" falls apart when
> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> The bit vector approach "does not scale".

It scales perfectly (linearly) with number of cores as long as the strict affinity has sense.

> So, I ask: what would a programming model for assigning processor
> affinity when the number of cores is not easily mapped into common
> register widths ?

Small memory area is the solution. Cache line will be sufficient for many years.
In practice processor allocation will be similar to memory allocation. You rarely ask for a specific physical bytes, just ask for compact areas.
Short to medium scale connectivity should have redundant capacity that can absorb imperfect allocations.
Virtualization layer should allow for dynamic core allocation, core migrations etc. It is a complex optimization problem.

In the large system topology defined by cores connectivity will be important.
On small scale topology will be tree-like, with shared memory and address space.
On larger scales - 3 dimensional connection mesh, topology of physical space will win.
Physical realisations od 3d topology can have fractal - like tree structure of connections with shortcuts and "routers".
The the largest scale physics forces us to use 2 dimentional surface of celestial body, then Dyson sphere ;)

EricP

unread,

Aug 24, 2021, 1:22:57 PM8/24/21

to

MitchAlsup wrote:
> On Monday, August 23, 2021 at 2:23:16 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> I have been giving some thought to processor affinity. But I want to
>>> look at the future where there may be an essentially unbounded
>>> number of cores.
>>> <
>>> The current model of 1-bit to denote "any core" and then a bit vector
>>> to say "any of these cores can run this task/thread" falls apart when
>>> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
>>> The bit vector approach "does not scale".
>>> <
>>> So, I ask: what would a programming model for assigning processor
>>> affinity when the number of cores is not easily mapped into common
>>> register widths ?
> <
>> Physical memory is allocated to minimize the distance between a
>> thread's preferred core, and that cores memory or neighbor memory.
>> The bit vector model does not take NUMA distance to memory into account.
>>
>> A thread would be allocated NUMA memory as close as possible,
>> and would continue to use that memory until it is recycled.
> <
> So, would an I/O driver (SATA) want to be closer to his memory or closer
> to his I/O (Host Bridge) ?
> <
> Would interrupt from the Host Bridge be directed to a core in this "Node" ?
> So, could the device driver be situated close to its memory but its interrupt
> handler closer to the device ?

I started to answer but the answer turned into a book. I'll try again.

Here we are talking specifically about HW threads servicing interrupts.
These would not be like normal application threads as they are part of
the OS kernel and are working on behalf of a cpu.
Prioritized HW interrupt threads are equivalent to prioritized
interrupt service routines (ISR).

If I was doing this there would be one HW thread for each core that can
service interrupts, for each HW priority interrupt request level (IRQL).
Each interrupt thread has a thread header plus a small (12-16 kB) stack.
These threads are created at OS boot, each has a hard affinity with
one core and the pages should be allocated from that core's home memory.

The Priority Interrupt Controller (PIC), or what x64 calls APIC,
controls to which core interrupts are routed. If any interrupt can be
routed to any core, and if interrupt priorities are 1 (low) to 7 (high),
and if we have 64 cores, then there are 448 HW threads dedicated
to servicing interrupts (which is why we don't do this).

The SMP interrupt service threads use per-device, per-interrupt spinlocks
to serialize their execution of each device's ISR.

To answer your question, we could program the PIC to route the
interrupts to the physically closest core, and that core's
interrupt service thread would use that core's memory for its stack.

If we know a particular device driver will always be serviced by a
single core then when the driver is loaded or the devices are mounted,
its code and data should also be allocated from that core's home memory.

In reality there is no need for full SMP servicing of interrupts
on any core. Alpha VMS handled all interrupts on core-0.
There are likely only a couple of cores that are near a bridge,
so in reality maybe we only need 2 cores with 7 interrupt threads each.

Also each core has a HW thread for executing at its
software interrupt level, executing things like what WNT calls
Deferred Procedure Calls (DPC) callback work packets
which can be posted by, amongst other things, ISR's to continue
processing an interrupt at a lower priority level.

Each core needs an idle thread to execute when it has no other work.
It does various housekeeping like pre-zeroing physical memory pages
for the free list.

> <
> Would a device page fault be directed at the interrupt handler or at I/O
> page fault handling driver ?

OS kernel code and data falls into two categories, pagable and non-pagable.
Some OS only allow non-pagable kernel memory.
If a page fault occurs in a non-pagable memory section, the OS crashes.
In any OS that I am familiar with, all interrupt code and data are
non-pagable in order to avoid all sorts of nasty race conditions.

when a device driver is loaded, the memory sections specified by its exe
are allocated and assigned physical pages at create time,
the pages are pinned so they never get outswapped,
and the code or data is copied from the exe file at driver load.
Similarly, as each device of that driver is mounted,
any local memory for that device is allocated from non-paged heap.

As I said above, if we know a certain driver or device will always
be serviced by a single core, OS should ensure that only local
physical memory is allocated at driver load and device mount.

>> Having two or more threads with shared process address space running on
>> different cores raises the question which NUMA node should the memory
>> be allocated on? Presumably each thread's stack would allocate on
>> the current core.
>>
>> Moving thread T1 from core C0 to core C1 might have a shared L2,
>> to core C3 have 1 hop back to the memory previously allocated on C0,
>> to core C4 have 2 hops back to previous memory allocated on C0.
>> This extra hop cost continues until physical pages are recycled.
>>
>> Moving a thread between cores also usually does not take into account
>> the loss of investment in the cache, and the delay to re-charge the cache.
>> Just guessing, but capacitor charge-discharge model probably applies.
>>
>> This suggests a hierarchy of costs to move up, across, and then down
>> between cores for load balancing to take into account.
> <
> Thanks for this observation.
> <
> Now what about multi-threaded cores, the cache cost of migrating around a
> MT core is low, but the throughput could be lower due to both (or many) cores
> sharing execution resources?

If an application, say a database server, creates multiple threads,
one per core and binds it to that core, then the physical memory for
the stack of each thread should come from the local home memory and
that would be optimal for that thread.

But there is still shared code and data pages. Whichever home the physical
memory for code and data is allocated from, it will be non-optimal for
all but one thread. No way around this.

If threads move from core to core to balance load, the OS scheduler needs
to take into account that this move is not free by having some hysteresis
in the algorithm to make things a bit sticky.

MitchAlsup

unread,

Aug 24, 2021, 3:01:44 PM8/24/21

to

LoL

<
>
> Here we are talking specifically about HW threads servicing interrupts.
> These would not be like normal application threads as they are part of
> the OS kernel and are working on behalf of a cpu.
> Prioritized HW interrupt threads are equivalent to prioritized
> interrupt service routines (ISR).
>
> If I was doing this there would be one HW thread for each core that can
> service interrupts, for each HW priority interrupt request level (IRQL).
> Each interrupt thread has a thread header plus a small (12-16 kB) stack.
> These threads are created at OS boot, each has a hard affinity with
> one core and the pages should be allocated from that core's home memory.
<

Ok, but let me throw a monkey wrench into this::
<
Say we have an environment supporting virtual devices. That is, there is
a table in memory that maps the virtual device into a physical device
so that the hypervisor can allow the hosted OS to perform its own I/O
to something like a SATA drive. The virtualized SATA driver gets a
virtual Bus:Device,Function, and his page map contains a page in
memory mapped I/O space pointing at virtual device MMIO control
registers. The virtual device driver, running down near user priority level
does 5-10 stores to the virtual device control registers to initiate a read or
write to the disk. The disk schedules the access as it chooses, and later on
when DMA is ready, the device table is used to associate virtual
Bus:Device,Function with physical Bus:Device,Function and also the
virtual machine and mapping tables of the OS this DMA request should
use.

<
>
> The Priority Interrupt Controller (PIC), or what x64 calls APIC,
> controls to which core interrupts are routed. If any interrupt can be
> routed to any core, and if interrupt priorities are 1 (low) to 7 (high),
> and if we have 64 cores, then there are 448 HW threads dedicated
> to servicing interrupts (which is why we don't do this).
<

My recent research on this shows the number of interrupt "levels"
increasing to 256-ish interrupts, each interrupt vector having its
own priority and a few related things. AMD virtualization via I/O-MMU.
<
So
<
DMA address gets translated though the OS mapping tables using
the requesting thread virtual address translated to host virtual address
which is translated by the virtual machine mapping tables to physical
address.
<
{Now this is where it gets interesting}
<
After DMA is complete, device sends an interrupt to devices interrupt
handler (which is IN the OS under the HV). Interrupt causes a control
transfer into interrupt handler, interrupt handler reads a few control
registers in memory mapped I/O space, builds a message for a lower
level Task handler , and exits. Interrupt handler ran at "rather" high priority,
task handler runs at medium priority. Task handler takes message
rummages through OS tables and figures out who needs (and what kind
of) notifications, and performs that work. Then Task handler exits (in one
way or another) to the system scheduler. System scheduler figures out who
to run next--which may have changed since the arrival of the interrupt.
<
All of the last several paragraphs are transpiring as other interrupts
are raised and various control transfers take place. {Sometimes in
rather bazaar orders due to timing; but it all works out.}
<
In this virtual device case:: the OS may not KNOW where its device actually
is {with paravirtualization it would} and so may not be able to properly
affinitize the handlers.
<
Anyway, I have figured out the vast majority of the device virtualization
requirements, and have been working on the kinds of HW services
are required to make this stuff fast {Mostly just caching and careful
Table organization.}

>
> The SMP interrupt service threads use per-device, per-interrupt spinlocks
> to serialize their execution of each device's ISR.
<

I have figured out a way to perform this serialization without any
kind of locking......

>
> To answer your question, we could program the PIC to route the
> interrupts to the physically closest core, and that core's
> interrupt service thread would use that core's memory for its stack.
<

Yes, route interrupts from the virtual device to the virtual requestor
interrupt handler and perform control transfer {based on priority of
the interrupt and the priority of the core set it is allowed to run on}

>
> If we know a particular device driver will always be serviced by a
> single core then when the driver is loaded or the devices are mounted,
> its code and data should also be allocated from that core's home memory.
<

So somebody in the chain of command granting guest access to virtual
device does the affinity or pinning of the interrupt threads to core set.
<
Check.
<
I suspect for completeness, the 'task' handler is pinned to the same
core set.
<
Now imagine that the interrupt handler can transfer control to the
task handler without an excursion through the OS (or HV) process
schedulers.
<
And that when Task Handler is done, it can return to interrupted process
(letting it run) while OS (and HV) search through their scheduling tables
to figure out what is the next process to run. {Remember I am talking
about a machine with inherent parallelism.}
<
When the OS/HV decides that process[k] is the next thing to run on
core[j] there is a means to cause a remote core[j] to context switch
from what it was doing to process[k] in a single instruction. Control
is transferred and cor[j] begins running process[k], the the OS/HV
thread continues on unabated. It spawned process[k] onto core[j]
without losing control of who it was or where it is.

>
> In reality there is no need for full SMP servicing of interrupts
> on any core. Alpha VMS handled all interrupts on core-0.
> There are likely only a couple of cores that are near a bridge,
> so in reality maybe we only need 2 cores with 7 interrupt threads each.
<

Simplifications are nice. But Alpha was from an era where there
were not than many nodes in a system. I am wonder what the
best model is for the future when those assumptions no longer
hold.

>
> Also each core has a HW thread for executing at its
> software interrupt level, executing things like what WNT calls
> Deferred Procedure Calls (DPC) callback work packets
> which can be posted by, amongst other things, ISR's to continue
> processing an interrupt at a lower priority level.
>
> Each core needs an idle thread to execute when it has no other work.
> It does various housekeeping like pre-zeroing physical memory pages
> for the free list.
<

Unless you have a means to make it simply appear as if the page
were filled with zeroes. In which case you allocate the page upon
call to malloc, and as you touch every line it automagically gets filled
with zeros (without reading memory to fill the lines--like Mill).
<
But there always seems to be an idle process somewhere running
at a priority lower than everyone else.
<
Thanks for reminding me of this.

> > <
> > Would a device page fault be directed at the interrupt handler or at I/O
> > page fault handling driver ?
<
> OS kernel code and data falls into two categories, pagable and non-pagable.
> Some OS only allow non-pagable kernel memory.
> If a page fault occurs in a non-pagable memory section, the OS crashes.
<

Question: In an OS under HV, does the HV have to know that OS page
is pinned ? Or since OS is under HV, HV can page this out, OS takes
page fault into HV, HV services ad returns, so as far as OS "sees'
the page is "resident". On the other hand, enough latency might have
transpired for the device to have failed before interrupt handler could
service interrupt. !?!?

<
> In any OS that I am familiar with, all interrupt code and data are
> non-pagable in order to avoid all sorts of nasty race conditions.
<

Check, and good to know (be reminded of)

>
> when a device driver is loaded, the memory sections specified by its exe
> are allocated and assigned physical pages at create time,
> the pages are pinned so they never get outswapped,
> and the code or data is copied from the exe file at driver load.
> Similarly, as each device of that driver is mounted,
> any local memory for that device is allocated from non-paged heap.
<

I suspect that when one plugs in a USB device this happens at that instant.

>
> As I said above, if we know a certain driver or device will always
> be serviced by a single core, OS should ensure that only local
> physical memory is allocated at driver load and device mount.
<

Check

<
> >> Having two or more threads with shared process address space running on
> >> different cores raises the question which NUMA node should the memory
> >> be allocated on? Presumably each thread's stack would allocate on
> >> the current core.
> >>
> >> Moving thread T1 from core C0 to core C1 might have a shared L2,
> >> to core C3 have 1 hop back to the memory previously allocated on C0,
> >> to core C4 have 2 hops back to previous memory allocated on C0.
> >> This extra hop cost continues until physical pages are recycled.
> >>
> >> Moving a thread between cores also usually does not take into account
> >> the loss of investment in the cache, and the delay to re-charge the cache.
> >> Just guessing, but capacitor charge-discharge model probably applies.
> >>
> >> This suggests a hierarchy of costs to move up, across, and then down
> >> between cores for load balancing to take into account.
> > <
> > Thanks for this observation.
> > <
> > Now what about multi-threaded cores, the cache cost of migrating around a
> > MT core is low, but the throughput could be lower due to both (or many) cores
> > sharing execution resources?
<
> If an application, say a database server, creates multiple threads,
> one per core and binds it to that core, then the physical memory for
> the stack of each thread should come from the local home memory and
> that would be optimal for that thread.
>
> But there is still shared code and data pages. Whichever home the physical
> memory for code and data is allocated from, it will be non-optimal for
> all but one thread. No way around this.
<

Yes no way around the data base actually occupying good chunks of
memory on every node which there is memory.

>
> If threads move from core to core to balance load, the OS scheduler needs
> to take into account that this move is not free by having some hysteresis
> in the algorithm to make things a bit sticky.
<

My guess is that the data base wants to move units of work from DBThread to
DBThread so each DBThread remains near its working memory. That is
the process might migrate around, the the core-cache footprint remains
localized.
<
While the OS threads are manages as the OS sees fit, and is instructed under
whatever affinity is supported in the OS.

EricP

unread,

Aug 25, 2021, 10:53:19 PM8/25/21

to

You left out a few steps.

- mapping the virtual device ids (or device handle) to an actual device
- checking permissions for that operation on that device
- validate IO arguments (e.g. buffer address and range)
- checking IO size vs quotas, maybe break 1 big IO into many smaller
- translate virtual addresses to physical, choose number to pin
- if virtual page not Present then inswap
- pin physical page to prevent removal/recycling
- queue request to device
- when device available, queue for a free DMA channel
- when DMA channel available program actual device hardware
- raise interrupt priority to block device interrupts
- spinlock to prevent concurrent access to device interrupt data
- write device control registers

On IO completion interrupt undo all the things we did above,
releasing resources and possibly allowing other queued IO to continue.

Maybe loop over all of this if the IO was larger than resource allocations
allow to be allocated at once to a single requester.

There is also asynchronous IO cancellation to deal with,
both when IO is queued and, depending on the device, possibly while running.

> <
>> The Priority Interrupt Controller (PIC), or what x64 calls APIC,
>> controls to which core interrupts are routed. If any interrupt can be
>> routed to any core, and if interrupt priorities are 1 (low) to 7 (high),
>> and if we have 64 cores, then there are 448 HW threads dedicated
>> to servicing interrupts (which is why we don't do this).
> <
> My recent research on this shows the number of interrupt "levels"
> increasing to 256-ish interrupts, each interrupt vector having its
> own priority and a few related things. AMD virtualization via I/O-MMU.

There are two issues: one is the number of interrupt priority levels,
and one is the ability to determine directly which device is interrupting.

I have heard of 255 levels on embedded systems where designs have a fixed,
well known set devices to deal with, they like to use one interrupt level
per device interrupt. Because the OS is often doing hard real-time,
having a unique priority for each device source fits nicely into
the deadline scheduling model.

For general purpose computers, a large number of priority interrupt
levels offers nothing.
However being able to directly determine which device id requested
an interrupt saves polling all the devices on a daisy chain.
One could have a small number of priorities, say 7 or 9,
but also something like PCI's Message Queued Interrupts to put
the requesting device's ID number into the queue for that priority
connected to a huge number of devices.
If a request queue is not empty then a hardware interrupt is
requested at the queue's priority.

This is why I defined my interrupt architecture as model specific,
because there is no one size fits all. Similarly, the cpu's interface
for interrupt handling is also model specific.

> <
> So
> <
> DMA address gets translated though the OS mapping tables using
> the requesting thread virtual address translated to host virtual address
> which is translated by the virtual machine mapping tables to physical
> address.

check

> <
> {Now this is where it gets interesting}
> <
> After DMA is complete, device sends an interrupt to devices interrupt
> handler (which is IN the OS under the HV). Interrupt causes a control
> transfer into interrupt handler, interrupt handler reads a few control
> registers in memory mapped I/O space, builds a message for a lower
> level Task handler , and exits. Interrupt handler ran at "rather" high priority,
> task handler runs at medium priority. Task handler takes message
> rummages through OS tables and figures out who needs (and what kind
> of) notifications, and performs that work. Then Task handler exits (in one
> way or another) to the system scheduler. System scheduler figures out who
> to run next--which may have changed since the arrival of the interrupt.

Yes.
The notifications to other tasks (or oneself) can complete a wait operation.
Completing a wait may transition a thread from Wait to Ready state
if it was not already Ready or Running.
Transition of thread to Ready state requests scheduler.
Eventually scheduler selects thread that requested IO which performs
any final resource deallocation inside the process address space.

I just realized there are two different schedulers.
One for threads that service HW like interrupts and other non-pagable code.
One for threads that can can perform waits like for IO or page faults.

> <
> All of the last several paragraphs are transpiring as other interrupts
> are raised and various control transfers take place. {Sometimes in
> rather bazaar orders due to timing; but it all works out.}

Yes

> <
> In this virtual device case:: the OS may not KNOW where its device actually
> is {with paravirtualization it would} and so may not be able to properly
> affinitize the handlers.
> <
> Anyway, I have figured out the vast majority of the device virtualization
> requirements, and have been working on the kinds of HW services
> are required to make this stuff fast {Mostly just caching and careful
> Table organization.}
>> The SMP interrupt service threads use per-device, per-interrupt spinlocks
>> to serialize their execution of each device's ISR.
> <
> I have figured out a way to perform this serialization without any
> kind of locking......

Ok

>> To answer your question, we could program the PIC to route the
>> interrupts to the physically closest core, and that core's
>> interrupt service thread would use that core's memory for its stack.
> <
> Yes, route interrupts from the virtual device to the virtual requestor
> interrupt handler and perform control transfer {based on priority of
> the interrupt and the priority of the core set it is allowed to run on}
>> If we know a particular device driver will always be serviced by a
>> single core then when the driver is loaded or the devices are mounted,
>> its code and data should also be allocated from that core's home memory.
> <
> So somebody in the chain of command granting guest access to virtual
> device does the affinity or pinning of the interrupt threads to core set.
> <
> Check.
> <
> I suspect for completeness, the 'task' handler is pinned to the same
> core set.

check

> <
> Now imagine that the interrupt handler can transfer control to the
> task handler without an excursion through the OS (or HV) process
> schedulers.

Any intermediate steps after IO completes would be to free resources
so other IO requesters can use them. And those resources are managed
inside the OS.

I mean inside the OS when physical pages are recycled.
First they go onto the free-dirty list.

The OS can have a pool of pre-zeroed physical pages
for when someone faults a zero-init page.

One of the things a background thread does is take free-dirty
physical pages, zero them, put them on the free-zero list.

> <
> But there always seems to be an idle process somewhere running
> at a priority lower than everyone else.
> <
> Thanks for reminding me of this.
>>> <
>>> Would a device page fault be directed at the interrupt handler or at I/O
>>> page fault handling driver ?
> <
>> OS kernel code and data falls into two categories, pagable and non-pagable.
>> Some OS only allow non-pagable kernel memory.
>> If a page fault occurs in a non-pagable memory section, the OS crashes.
> <
> Question: In an OS under HV, does the HV have to know that OS page
> is pinned ? Or since OS is under HV, HV can page this out, OS takes
> page fault into HV, HV services ad returns, so as far as OS "sees'
> the page is "resident". On the other hand, enough latency might have
> transpired for the device to have failed before interrupt handler could
> service interrupt. !?!?
> <
>> In any OS that I am familiar with, all interrupt code and data are
>> non-pagable in order to avoid all sorts of nasty race conditions.
> <
> Check, and good to know (be reminded of)
>> when a device driver is loaded, the memory sections specified by its exe
>> are allocated and assigned physical pages at create time,
>> the pages are pinned so they never get outswapped,
>> and the code or data is copied from the exe file at driver load.
>> Similarly, as each device of that driver is mounted,
>> any local memory for that device is allocated from non-paged heap.
> <
> I suspect that when one plugs in a USB device this happens at that instant.

Yes, in part at boot for the USB hub controller.
In part when a new device is plugged in and recognized.

MitchAlsup

unread,

Aug 26, 2021, 2:16:44 PM8/26/21

to

I am assuming this is just std OS/HV stuff running privileged but
not at very high priority.

<
> - when device available, queue for a free DMA channel
> - when DMA channel available program actual device hardware
<

OS or HV on paravirtualized OS

<
> - raise interrupt priority to block device interrupts
> - spinlock to prevent concurrent access to device interrupt data
<

This spinlock is what I think I can get rid of.

<
> - write device control registers
>
> On IO completion interrupt undo all the things we did above,
> releasing resources and possibly allowing other queued IO to continue.
<

Is the above paragraph done at the Task handler level or the interrupt handler level?

>
> Maybe loop over all of this if the IO was larger than resource allocations
> allow to be allocated at once to a single requester.
>
> There is also asynchronous IO cancellation to deal with,
> both when IO is queued and, depending on the device, possibly while running.
> > <
> >> The Priority Interrupt Controller (PIC), or what x64 calls APIC,
> >> controls to which core interrupts are routed. If any interrupt can be
> >> routed to any core, and if interrupt priorities are 1 (low) to 7 (high),
> >> and if we have 64 cores, then there are 448 HW threads dedicated
> >> to servicing interrupts (which is why we don't do this).
> > <
> > My recent research on this shows the number of interrupt "levels"
> > increasing to 256-ish interrupts, each interrupt vector having its
> > own priority and a few related things. AMD virtualization via I/O-MMU.
<
> There are two issues: one is the number of interrupt priority levels,
<

There are 64 priority levels, both handlers and processes run on these.
Although this is not fixed in stone

<
> and one is the ability to determine directly which device is interrupting.
<

physical device send virtual interrupt, interrupt vector table determine priority
and which processor and core the interrupt handler should run.

>
> I have heard of 255 levels on embedded systems where designs have a fixed,
> well known set devices to deal with, they like to use one interrupt level
> per device interrupt. Because the OS is often doing hard real-time,
> having a unique priority for each device source fits nicely into
> the deadline scheduling model.
>
> For general purpose computers, a large number of priority interrupt
> levels offers nothing.
<

How few is enough, but in any event I can fit 256 interrupt voctors,
64 exception vectors, 256 processes, and the 64-priority Queues
on a single page of memory. 64 was chosen merely because it fit.

<
> However being able to directly determine which device id requested
> an interrupt saves polling all the devices on a daisy chain.
<

Interrupt handler receives virtual Bus:Device,Function Node,
Interrupt number, and the process which initiated the I/O.

<
> One could have a small number of priorities, say 7 or 9,
> but also something like PCI's Message Queued Interrupts to put
> the requesting device's ID number into the queue for that priority
> connected to a huge number of devices.
<

Yes a large part of this priority queue is to put a place where message
signaled interrupts can land and wait for their turn at execution
resources.

<
> If a request queue is not empty then a hardware interrupt is
> requested at the queue's priority.
<

I was thinking more along the lines that this Queue reads from
the highest priority non-blocked queue, builds Thread State and
sends a context switch to a particular processor:core. The
message contains all the bits on needs to run the process, upon
arrival, whatever the processor:core was doing gets messaged
back to queue.
<
Thus, for interrupts, one has thread state of the interrupt
handler along with the notification of (ahem) the interrupt.
That is, as seen at a processor:core, the notification (interrupt)
arrives only 1 cycle before the thread state, including registers.

In my model: The completing I/O can send (MSI) an interrupt to the
waiting thread, and the hardware places the waiting process on
appropriate priority queue where it awaits execution cycles. No
need to invoke scheduler (directly). Eventually the unWaited process
reached the front of the highest non-Empty non-Blocked Queue,
and gets launched into execution with a Thread State message.

>
> I just realized there are two different schedulers.
> One for threads that service HW like interrupts and other non-pagable code.
> One for threads that can can perform waits like for IO or page faults.
<

I saw this as 2-levels, your second level always being dependent on
the first. The first level runs non-pagable, the second runs pagable.

<
> > <
> > All of the last several paragraphs are transpiring as other interrupts
> > are raised and various control transfers take place. {Sometimes in
> > rather bazaar orders due to timing; but it all works out.}
> Yes
> > <

> > Unless you have a means to make it simply appear as if the page
> > were filled with zeroes. In which case you allocate the page upon
> > call to malloc, and as you touch every line it automagically gets filled
> > with zeros (without reading memory to fill the lines--like Mill).
<
> I mean inside the OS when physical pages are recycled.
> First they go onto the free-dirty list.
>
> The OS can have a pool of pre-zeroed physical pages
> for when someone faults a zero-init page.
>
> One of the things a background thread does is take free-dirty
> physical pages, zero them, put them on the free-zero list.
<

We could make this another Idle process (or really low priority
at least until the free page pool is rather dry when its priority is
changed and it runs more often.

MitchAlsup

unread,

Aug 29, 2021, 6:24:16 PM8/29/21

to

On Thursday, August 26, 2021 at 1:16:44 PM UTC-5, MitchAlsup wrote:
> On Wednesday, August 25, 2021 at 9:53:19 PM UTC-5, EricP wrote:
<

Given a definition where Node means chip; and chip can contain
multiple processors, each processor having several cores (SMT or
other), along with memory controllers, Node repeater links, and
host bridges::
<
It seems to me that in an age of unbounded processors and cores,
using the affinity relationship to find a processor suitable for running
an affinitized process is harder than simply moving an already affinitized
process between RUN<->WAIT.
<
Thus, one can suitably "throw" HW at the run<->wait latency*, but
one should never bother to throw HW at the affinity problem--that
should always remain a problem the OS(HV) has to solve.
<
But this could lead to it being taking longer to affinitize a process to
a processor (and suitable core thereon) that it is to context switch
into an already affinitized process.
<
(*) both as seen in total overhead cycles and as seen by either OS(HV)
controlling a process, or by process being controlled.

JimBrakefield

unread,

Aug 29, 2021, 7:45:17 PM8/29/21

to

Affinity allocation of cores would seem to require three aspects:

A) Definition of an affinity metric.
B) Parameters in the core(s) request (Number of cores, duration, memory
scratch space, existing memory allocation, cost for moving memory and
cache contents, etc).
C) Hardware to evaluate the affinity metric across available resources.

MitchAlsup

unread,

Aug 29, 2021, 9:58:42 PM8/29/21

to

On Sunday, August 29, 2021 at 6:45:17 PM UTC-5, JimBrakefield wrote:
> On Sunday, August 29, 2021 at 5:24:16 PM UTC-5, MitchAlsup wrote:
> > On Thursday, August 26, 2021 at 1:16:44 PM UTC-5, MitchAlsup wrote:
> > > On Wednesday, August 25, 2021 at 9:53:19 PM UTC-5, EricP wrote:
> > <
> > Given a definition where Node means chip; and chip can contain
> > multiple processors, each processor having several cores (SMT or
> > other), along with memory controllers, Node repeater links, and
> > host bridges::
> > <
> > It seems to me that in an age of unbounded processors and cores,
> > using the affinity relationship to find a processor suitable for running
> > an affinitized process is harder than simply moving an already affinitized
> > process between RUN<->WAIT.
> > <
> > Thus, one can suitably "throw" HW at the run<->wait latency*, but
> > one should never bother to throw HW at the affinity problem--that
> > should always remain a problem the OS(HV) has to solve.
> > <
> > But this could lead to it being taking longer to affinitize a process to
> > a processor (and suitable core thereon) that it is to context switch
> > into an already affinitized process.
> > <
> > (*) both as seen in total overhead cycles and as seen by either OS(HV)
> > controlling a process, or by process being controlled.
> Affinity allocation of cores would seem to require three aspects:
>
> A) Definition of an affinity metric.
<

Yes, we agreed earlier in this thread that this metric would take on the
kind of addressing used in networks--more hierarchical and less flat
than the current bit vector approach.

<
> B) Parameters in the core(s) request (Number of cores, duration, memory
> scratch space, existing memory allocation, cost for moving memory and
> cache contents, etc).
<

I think we agreed earlier that there would be a number of Nodes in a system
each node would have several processors and at least one Host Bridge
and at least one Memory Controller. Each processor could have several
cores, each core could run several threads.
<
In addition, we agreed that the reason to use affinity was to preserve
the momentum of the cache hierarchy (including TLBs and Table Walk
caches), either by running "things that conflict" in different processors,
or by running "things that share" in the same processors.
<
Greatest performance occurs when a process continues to run on the
original core.
<
Next greatest comes when there is another core in the same cache hierarchy
that can run process.
<
But sooner or later, someone has to decide that process[k] wants to run
and since core[c] is idle, why not just run k on c.

<
> C) Hardware to evaluate the affinity metric across available resources.
<

Since we don't know what affinity data looks or even smells like (other than
smelling like network addressing) it is a bit early to think about HW
looking across the affinity data and deciding what goes on who.
<
But I think we can agree that the process of making this decision is considerably
harder than the job of picking up a process[k] already loosely bound to core[c]
and running it on core[c] again, and again, and again.
<
In fact, I am working on HW that performs context switching amongst
processes that have already been affinitized, leaving the OS to only need
to bind the process to the processor.......................

JimBrakefield

unread,

Aug 29, 2021, 10:35:36 PM8/29/21

to

Thanks, now I understand some of the full context of the subject.
And next time promise to web search any new terms, no matter how straight forward they sound.

Ivan Godard

unread,

Aug 29, 2021, 11:24:46 PM8/29/21

to

Seems like a hierarchy like that is a baby step toward
processor-in-memory - why not go all the way?

EricP

unread,

Aug 30, 2021, 1:24:21 PM8/30/21

to

Yes. Most of these are cpu-level resources so are managed
at the same reentrancy level as kernel heap or scheduling.

> <
>> - when device available, queue for a free DMA channel
>> - when DMA channel available program actual device hardware
> <
> OS or HV on paravirtualized OS

Both as the guest OS is going to manage them as though
it is a real SMP machine.

And at some point HV has to turn them into real resources
which means it does the same things.

> <
>> - raise interrupt priority to block device interrupts
>> - spinlock to prevent concurrent access to device interrupt data
> <
> This spinlock is what I think I can get rid of.

Replaced spinlocks by what?
The concurrent access to the shared memory & resources must be coordinated.

> <
>> - write device control registers
>>
>> On IO completion interrupt undo all the things we did above,
>> releasing resources and possibly allowing other queued IO to continue.
> <
> Is the above paragraph done at the Task handler level or the interrupt handler level?

The OS scheduler and heap management level, whatever that is called.

Its all down to reentrancy - most of this code and data structures
are not thread or interrupt reentrant.

This is the same problem as was discussed in the earlier thread about
asynchronous signal delivery and, say, memory heap management.
In that case heap access is disallowed from inside a signal routine
or any routine called by that signal routine. So we convert an
asynchronous signal to a synchronous message queued to the main
routine and poll the queue in the main loop.

OS has the same reentrancy problems but for more things.
So we create this fiction of a software interrupt priority level,
that sits just above where threads run and below where ISR's run,
where we loop servicing our queue of message from interrupts routines
and accessing all the non-reentrant data structures, like kernel heap
or scheduler, and doing so without blocking interrupts.
In WinNT the name of that fictional SWI level is Dispatch level.

Most of these resources are managed at "Dispatch" level.

Things like allocating and freeing a bounce buffer for DMA.
Or a DMA channel from a common pool. These might have multiple
IO requests waiting in a queue to use them.
As each IO completes, it frees its resources, which may dequeue the
next IO request, and cause it to continue its request while this
IO continues to complete its.

By restricting allocate & free of these resources from being performed
at interrupt level, they do not have to block interrupts.

For external devices I had three, low, medium and high,
with three sub levels in each numbered from low to high
as a concession that not all devices are equal.

Priority assignment is based on the nature of interrupt.

Low is for patient devices.
Mostly signaling completion of a prior request and have no time deadline.
Disk read or write completion is an example.
Network card buffer transmit is also (but usually not buffer receive).

Medium is for impatient but recoverable devices.
This would be for signaling receipt of some message where
a buffer overrun can occur, but can be recovered at some cost.
Network card buffer receive is an example.
Also tape drive buffer read or write.

High is for impatient and non-recoverable devices.
This would be for signaling receipt of some message where
a buffer overrun can occur and cannot be recovered.
Reading an analog to digital converter could be an example.

Yes, except I'm now thinking of 3 levels:
- normal user and kernel threads which can do things like wait for IO
- OS level HW threads which can do things like allocate and free
resources, and schedule and wake up normal threads.
- interrupt level thread to service device interrupts.

Its all about controlling reentrancy and where and when thread Wait
states for things like page fault IO can occur so that it cannot
indefinitely block other OS activity.

MitchAlsup

unread,

Aug 30, 2021, 1:52:42 PM8/30/21

to

The place where it is easiest to see the scheduling data, and the places
where scheduled processors run do not need any correspondence, and
those running processes have access to an entire 64-bit virtual address
space, not just local.

Stefan Monnier

unread,

Aug 30, 2021, 2:02:44 PM8/30/21

to

> Seems like a hierarchy like that is a baby step toward processor-in-memory -
> why not go all the way?

I suspect that the design space between the two extremes is probably
more interesting than either extreme.

Stefan

MitchAlsup

unread,

Aug 31, 2021, 12:31:39 PM8/31/21

to

I can agree that you put your finger on the problem. A set of state (process)
can be running at most once.

>
> This is the same problem as was discussed in the earlier thread about
> asynchronous signal delivery and, say, memory heap management.
> In that case heap access is disallowed from inside a signal routine
> or any routine called by that signal routine. So we convert an
> asynchronous signal to a synchronous message queued to the main
> routine and poll the queue in the main loop.
>
> OS has the same reentrancy problems but for more things.
> So we create this fiction of a software interrupt priority level,
> that sits just above where threads run and below where ISR's run,
> where we loop servicing our queue of message from interrupts routines
> and accessing all the non-reentrant data structures, like kernel heap
> or scheduler, and doing so without blocking interrupts.
> In WinNT the name of that fictional SWI level is Dispatch level.
>
> Most of these resources are managed at "Dispatch" level.
>
> Things like allocating and freeing a bounce buffer for DMA.
> Or a DMA channel from a common pool. These might have multiple
> IO requests waiting in a queue to use them.
> As each IO completes, it frees its resources, which may dequeue the
> next IO request, and cause it to continue its request while this
> IO continues to complete its.
>
> By restricting allocate & free of these resources from being performed
> at interrupt level, they do not have to block interrupts.
<

After thinking about this for 2 days, I think if I said any more I would
lose patentability. Sorry--really I am.

A mighty thanks to EricP for annotating in great detail the interplay
between threads, interrupts an OS functionality.

Quadibloc

unread,

Sep 1, 2021, 1:26:08 AM9/1/21

to

On Monday, August 23, 2021 at 12:24:05 PM UTC-6, MitchAlsup wrote:

> The current model of 1-bit to denote "any core" and then a bit vector
> to say "any of these cores can run this task/thread" falls apart when
> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> The bit vector approach "does not scale".

One simplistic way to represent processor affinity for a chip, or
multi-chip module, containing a very large number of cores,
where it's awkward to use a bit vector of more than 64 bits, would
be this:

First, one designates the "home core" of a program. This is the ID of
the core that the program will start from.

That core would belong to a group of 64 cores, and so a bit vector
would then be used to indicate which of the (other) processors in
that group could also be used for that task or thread.

These groups of 64 cores would then be organized into bundles of
64 such groups. If the task or thread could be executed by cores
_outside_ the initial group of 64 cores, the next bit vector would indicate
which other groups of 64 cores could also be used for the task. (One
assumes that in that case, if the task is that large in scale, any individual
core in an indicated group of 64 cores could be used.)

And then, of course, how it scales up is obvious. The next 64-bit vector says
which groups of 4,096 cores outside the initial group of 4,096 cores can be
used for the program, and so on and so forth.

John Savard

Stefan Monnier

unread,

Sep 1, 2021, 8:26:11 AM9/1/21

to

> And then, of course, how it scales up is obvious. The next 64-bit vector says
> which groups of 4,096 cores outside the initial group of 4,096 cores can be
> used for the program, and so on and so forth.

So you're proposing a kind of floating-point system, where the
granularity/mantissa specifies "64 things" and the "exponent" specified
if those things are cores, groups of cores (e.g. chips), or groups of
chips (e.g. boards), ...

Stefan

EricP

unread,

Sep 1, 2021, 10:16:49 AM9/1/21

to

Combining what Stephen Fuld and I wrote I'd summarize it
as there are two kinds of thread affinities:
hard which are fixed by the configuration of a particular system,
and soft which are dynamic.

The basic hard thread affinity is a capability bit vector,
which cores are capable of running a thread.
Maybe only some cores have a piece of hardware like FPU.

Next hard thread affinity is a ranked preference vector,
and there can be multiple cores of equal rank.
In a big-little system low priority threads might give equal
preference to the multiple little cores.

Another example of hard preference is based on network topology.
A particular core might have an attached device where it is
cheapest to talk to the device from that core.

0--1 core 0 1 2 3
| | rank 0 1 2 1
3--2
\
IO

Soft affinity is based on the recent history of a thread, any cache
it has invested in loading or the home directory of physical pages.
As a thread runs it invests more on its current local core
and lowers its investment in prior cores.

Soft affinity interacts with the network topology graph you describe,
to give a dynamic cost to locating threads at different cores.

All the affinity info folds into how the OS thread scheduler works.

One consideration is how much effort one wants to put into the
scheduler algorithm. The above affinities isn't just a traveling
salesman problem, its S traveling salesmen visiting C cities,
some with lovers, sometimes multiple ones, at different cities
that keep changing.

Another scheduler considerations is SMP data structures and locking.
If all the scheduler data is in one pile guarded by a single spinlock,
all cores can contend for it and you can wind up with a long queue
spin-waiting to reschedule.

That is why many SMP schedulers now separate their scheduling into
core-local lists which do not contend and require no spinlocks,
and global lists which are accessed infrequently and spinlock guarded.

Also in a 4000 node NUMA SMP system, a remote cache hit can be 400 nsec
away and DRAM can be 500 nsec. The more things the scheduler touches,
the more chance it has of touching one of those 400 or 500 nsec things,
and then it winds up sitting around playing the kazoo for a while.

It may not be worthwhile over-thinking thread affinity because
in the end it may be too complicated for the scheduler to make
use of in a reasonable window of time.

It might be worthwhile working backwards from the OS scheduler
to see what affinity information it can make use of,
then looking at how that information can be acquired.

MitchAlsup

unread,

Sep 1, 2021, 1:40:46 PM9/1/21

to

This is an excellent observation, missing only 1 point:
<
its S traveling salesmen visiting C cities where the next route is being
decided by K schedulers randomly placed around the country.
<
Where K is the number of OS/HV threads "taking a look" at the current
schedule.

<
>
> Another scheduler considerations is SMP data structures and locking.
> If all the scheduler data is in one pile guarded by a single spinlock,
> all cores can contend for it and you can wind up with a long queue
> spin-waiting to reschedule.
>
> That is why many SMP schedulers now separate their scheduling into
> core-local lists which do not contend and require no spinlocks,
> and global lists which are accessed infrequently and spinlock guarded.
<

Yes, this is where I thing it is heading. There is some master schedule
on which essentially all runnable processes reside. The first time a
process runs it may be placed rather randomly on a node (=chip) and
then as long as that node is not saturated with work, the process is
adequately served with compute cycles, so process[p] sticks around
on node[n] unless there becomes a compelling case where the OS
wants to run [p] on some other node[!=n]. OS then removes the
original affinity and applies a new affinity to the process.
<
There is an OS thread on node[n] that manages the schedule on
node[n] which is subservient to the master OS process scheduler.

>
> Also in a 4000 node NUMA SMP system, a remote cache hit can be 400 nsec
> away and DRAM can be 500 nsec. The more things the scheduler touches,
> the more chance it has of touching one of those 400 or 500 nsec things,
> and then it winds up sitting around playing the kazoo for a while.
<

It would surprise me that a remote cache was not a full microsecond or more.
Similar for DRAM. Most of this network (link) latency

Quadibloc

unread,

Sep 1, 2021, 11:44:00 PM9/1/21

to

Not quite. Because I realized that one other thing is needed. The subthreads
down in one of the other groups of cores should have their _own_ bit map on
the single core level. Because the boundaries between groups may be artificial,
and so one will want processor affinity with single-core granularity crossing group
boundaries.

Still, what you've said is sort of partly true...

John Savard

MitchAlsup

unread,

Sep 19, 2021, 4:19:23 PM9/19/21

to

Let us consider a system made out of 16 "processor" chips, each chip contains
16 cores, a PCIe bridge, a Memory Controller, and "processor to processor" links.
<
One process has ask the OS to perform some I/O and wake it back up when done.
This process[p] was originally running on node[n] in core[c] and the OS decides that
the accessed device is over on node[IO].
<
Given that Interrupt handlers are affinitized to the "chip" containing the PCIe
bridge to the actual device: when the I/O is complete, device sends interrupt
to node[IO] core[IH].
<
Interrupt handler runs, captures I/O completion status, and signals the {lower
priority] device driver of the completion; and exits. Device driver runs, sees the
completion of request from process[p] and schedules process[p] to run on node[n].
<
Process[p] is affinitized to ANY core in node[n], so Device driver sends IPI to some
core[IPI] on node[n], and this core[IPI] determines that process[p] can be run on core[j!=c];
places process[p] on the run queue of core[j]. If process[p] is of higher priority than
what is currently running on core[j], and IPI != j, core[IPI] and then sends an IPI to node[j]
to context switch into this process[p].
<
So, it seems to me that affinity has added at least 1 to the exponent of process scheduling
complexity.
<
<
Now consider that the time delay from sending an IPI to receiving the IPI on a different
node may be 200-400 cycles ! So, anytime a decision is made to "talk" to another core
on another node, by the time the IPI arrives, the state of the system can change markedly.
{The typical L1 cache delay is 3-4 cycles (as seen at the core), L2 12-24 cycles, L3 may
be on the order of 50 cycles, and then there is link delay from sending node to receiving
node, so there is lots of cycles from sending of an IPI to receipt of IPI no mater whether
these are routed as memory accesses or messages. In addition there is context switch
time to get to the IPI handler.}
<
So, IPI arrives, OS.process[IPI] looks at the run queue, and decides to IPI over to affinitized
core to run process[p]. Does process[p] get taken off the run queue at this point, or does
core[j] have to discover this all by itself ?
<
This is beginning to smell like a difficult journey, and much like the CPU{core} is seldom
in the proper place in the system to perform ATOMIC stuff in the fewest number of cycles
possible; it seems none of the threads in a multi-threaded OS is in a position to "take a look
at the run queues" without grabbing a big "lock".

Timothy McCaffrey

unread,

Sep 21, 2021, 2:13:48 PM9/21/21

to

For some time I have thought that the model of:
(Sending): Get lock -> Put stuff in queue -> release lock -> send IPI
(Recieving): Rcv IPI -> get lock -> get stuff out of queue -> release lock -> process stuff from queue

Was pretty inefficient.

I would like a hardware queue (perhaps similar to the old I2O interface that Intel pushed (but better!))
between processors. Then the processing becomes:
(sending):Create message -> push message pointer to other CPU via hardware register
(Receiving): (probably get interrupt when Hardware message queue goes from empty to non-empty)
Interrupt->
Loop:
get pointer from hardware queue->process stuff in message.
if hardware queue not empty go to loop

Ta Da! No locks!

(well, they are conceptually still there, they are (sort of) part of the hardware queue).

- Tim

MitchAlsup

unread,

Sep 21, 2021, 5:06:10 PM9/21/21

to

Modern times have proven that assumption dubious at best.

>
> I would like a hardware queue (perhaps similar to the old I2O interface that Intel pushed (but better!))
> between processors.
<

This is the direction I am trying to head without using those words.
<
There is some HW facility where you send it messages (and in this context and IPI is simply
a message) and in response the facility sends to an affinitized processor a context switch
message all based on the priorities in some kind of queuing system.
<
Since there are multiple chips, and each chip contains multiple cores, in any given clock
there could be multiple interrupts, messages, and what-not flying around, a core is not an
appropriate place to "perform this work".
<
So I am investigating a "facility" located in memory controller on 'chip'. This facility is
configured with a couple of pointers in config space which give facility exclusive access
to 'some memory'. In this memory there are interrupt vectors, exception vectors, and ADA
entry vectors. Associated with these 'vectors' is a set of priority queues (about 64 levels
seems sufficient). These vectors are the targets of messages {interrupts, exceptions,
calls, taps} and such messages use the priority in the vector to find the appropriate queue.
Facility is a lot like a function unit, except that it is located away from any core, and close
the the memory it has been assigned. Its instructions are the first doubleword of the
messages, its operands the rest of the message, and its results are messages to cores,
and it performs a bit of lookup and queuing as its raison détra.
<
In My 66000 the entire facility is accessed via a single address (per chip) in non-cacheable
memory. Facility heavily caches this space, but cores all see this space as non-cacheable.
The ISA has been augmented to send messages to facility {size 1-64 doublewords}, and
core HW has been augmented to receive a context switch message by absorbing all of the
message while producing a "I got context switched out" message.
<
By using a single address (misaligned to boot) for all messages (within a chip) to facility,
messages are forced to arrive in transit order, and arrive completely, and without interference.
Inter-chip messages use a different address which is singular to the addressed chip.
<
Thus, context switch is a LOT like CDC 6600 where the peripheral processors sent XCHG
instruction to the cores to context switch form on thread to another in 16 cycles. Facillity
is this peripheral processor here.

<
>Then the processing becomes:
> (sending):Create message -> push message pointer to other CPU via hardware register
> (Receiving): (probably get interrupt when Hardware message queue goes from empty to non-empty)
> Interrupt->
> Loop:
> get pointer from hardware queue->process stuff in message.
> if hardware queue not empty go to loop
>
> Ta Da! No locks!
<

Getting rid to the locks is of paramount importance.
<
One MUST include ADA call-accept queuing in the SAME overall structure.
So a call to an ADA entry point is a message--perhaps the message is as simple at the
pointer to thread-state {which contains thread header and register file. Thread header
contain the root pointer to translations}.
An ADA <conditional> accept is a similarly small message.
Should the call arrive before the accept, the call is placed on the tail of the queue of
that entry point.
Should the <unconditional> accept arrive before any call, it is placed on the tail=head
of the <empty> queue.
When a call is joined with an accept a context switch to the rendezvous entry point
is performed.
The rendezvous has access to the caller via the thread state where it can access
the register file and memory using something that smells a lot like the PDP-11/70
LD-from-previous, ST-to-previous memory reference instructions. These are played
out in the address space of the caller. Rendezbvous has access to its own memory
via normal LD and ST instructions.
When rendezvous is done, a message is sent to facility when then re-enables both
caller and ADA task simultaneously.
<
Rendezvous have the NATURAL property that they can only be ACTIVE once--hence
no re-entrancy problems or even the need to perform locks.
<
I plan on making Interrupt service routines have his property, and use facility to
perform the interrupt queuing. ISRs are serially reusable.
<
Back at facility: messages from cores to facility are short, messages from facility
to core is "everything needed to finish performing a context switch", 8 thread state
doublewords and 32 register doublewords. You do not tell a core to go and context
switch into something, you send it everything it needs to already be in that context.
In return, if core is already running something, it sends back the current state of
the running thread, which facility puts on the front of of its priority run queue.
<
Thus multiple scheduling and context switches can be simultaneously in progress
without any locking going on--all because it is not being performed in a core !
<
Additionally, since the context switch is a single message, there is no chance something
can get in the way while it is transiting the links, thus removing any need for core to
perform locks. In My 66000 such a context switch message is 9 cache lines long.
<
Also note: Facillity remembers at what priority level all cores (in a chip) are operating,
and does not send messages unless they are of higher priority that what is currently
running.
<
Other than the latency from facility to core, each core is always running the highest
priority thread that has been afinitized to it; and there is no (0, zero, nada, zilch)
OS overhead associated in normal thread context switch behavior. The OS only needs
to be involved run<->wait management and affinity.
<
All the operating system is left to do is to manage the run<->wait status of threads
and perform affinitization based on whatever mode of affinity model the OS is using.
<
<-------------------------------------------------------------------------------------------------------------------------------
<
So, say we have a core running user thread[t] and along comes an interrupt. Facility'
sees message, finds priority, sees that core[c] running thread[t] is of lower priority than
ISR[i] so facility sends ISR thread state to core[c]; core[c] sends back current context of
thread[t] and begins running ISR[i] while facility places thread[t] on the head of its run
queue.
<
In the performance of its duties; ISR[i] sends a message to run I/O cleanup task[k]
to facillity, and then <shortly> exits.
<
Sooner or later, facility sends the context switch into I/O cleanup take[k] to some core[?]
where it takes thread[m!=t] off the wait queue and places it on the run queue by sending
a run message to facility, then <after a while> exits.
<
After all the flury of context switching has transpired, original thread[t] return to running
on core[c] and waiting thread[k] is running on some core[m!=c].

Ivan Godard

unread,

Sep 21, 2021, 9:17:46 PM9/21/21

to

How do you track which quanta budget to charge for the switched-in runtime?

MitchAlsup

unread,

Sep 21, 2021, 9:39:05 PM9/21/21

to

Facility knows 'when' the context switch message was sent out and 'when' that threads
<then> current state gets sent back.
<
Thus, with access to Real Time Clock, the subtraction is easy.
<
It is more precise than when the transition through microcode takes 500 cycles to "take"
the first interrupt of the flury.