Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Intel QPI and the HyperTransport bus architecture, and some NUMA questions.

24 views
Skip to first unread message

Luciann Bennet

unread,
Dec 22, 2009, 5:37:47 PM12/22/09
to
Hi all;

How much would the new Intel QPI and the HyperTransport interconnect
designs impact o the way a kernel would have to initialize? These new
interconnect bus topologies both are NUMA enabled, and are being
implemented on all newer x86-64 processors.

This is likely to severely change the way desktop OSs relate to the
hardware on a chipset, or based on the specifications handed out in
the white pages for both bus topologies, the chipset_S_. The emphasis
there was on the plurality of the word chipset ;)

For example, in a system where you are guaranteed to have all
processors sharing the same bank of memory and not having their own
local RAM and an integrated memory controller on each physical CPU
unit, with each CPU actually connecting to a set of RAM chips totally
owned by itself only, you can make the assumption that all processors
will of course share the same RAM.

This means that enumeration of RAM is as simple as assuming that there
*IS* RAM connected to any particular CPU, or in fact all CPUs since
they all share the same memory controller. In other words, there is
only ONE bank of CPUs, which all share ONE memory controller.

But, on a NUMA system, enumerating RAM could be more troublesome: I'm
sure the idea of having 1 processor and 1 core at boot will remain. So
somehow, you'll have to enumerate the RAM on the local processor's
memory controller, then wake up all the cores on that memory bank.

Next, I assume you'll have to send some sort of inter-NUMA-bank
Interrupt, to wake the BSP on the other memory banks, and then have
those execute the code to initialize off the original BSP's memory
bank.

Assuming this will not cause too much contention, from there, since
all processors banks may not be connected physically to RAM, we have
to know which processor banks are not, and attach them to a bank which
is, and thus cause a bit on contention.

Then there's the question of how the other CPUs in other NUMA regions
will continue to execute kernel code: Do they all access the kernel in
the orginal BSP's local memory? Or do you find a way to copy the
kernel to each individual memory bank, so that the NUMA regions there
can execute the kernel locally without contending for the copy in the
1st NUMA region's local memory?

Then, if you would need to copy the kernel to each memory region, how
may I ask, would you have to link the kernel to do this? Forgive me if
this question seems particularly newbie-ish, but:

If I create a binary that is linked to physical 0x100000 (1MB), and
virtual 0xC0000000 (3GB), then what say I copy it to the NUMA address
space on NUMA region 2 so that those processors can have a copy of the
kernel in local memory, (here comes the key question) then would that
copy of the kernel be able to execute normally, given the fact that
NUMA region 2 could very well start at the configured PFN possibly
say...0x80000000?

That is to say, in a ccNUMA architecture, all of the processors see
all RAM on the motherboard as one logically contiguous sequence of
bytes from 0x0 to 2^(nbytes-1). Even if there may be two separate
memory regions mapped into that logically contiguous memory span.

So to clarify the above question about physical linking, and then
copying to another NUMA region's local memory, I mean to ask whether
copying a file physically linked to 1MB to a different physical
address will cause it not to execute properly.

I look forward eagerly to your replies.

Alexei A. Frounze

unread,
Dec 23, 2009, 3:12:53 AM12/23/09
to

I think the systems are implemented in such a way that every CPU can
access all memory. Accessing memory from remote nodes will incur a
certain overhead, though. Another thing to consider... These CPUs
don't run their own local copy of the OS shielded from other copies.
They work together in the same OS and at certain times they have to
access kernel (and other) data structures to share data between them
and execute synchronization code with locked instructions (e.g.
CMPXCHG in spinlocks). There's gotta be a way to share the RAM, even
if non-uniformly performance-wise.

Alex

0 new messages