[patch[ Simple Topology API

Matthew Dobson

ungelesen,

12.07.2002, 20:40:0412.07.02

an

Here is a very rudimentary topology API for NUMA systems. It uses prctl() for
the userland calls, and exposes some useful things to userland. It would be
nice to expose these simple structures to both users and the kernel itself.
Any architecture wishing to use this API simply has to write a .h file that
defines the 5 calls defined in core_ibmnumaq.h and include it in asm/mmzone.h.
Voila! Instant inclusion in the topology!

Enjoy!

-Matt

2.5.25-simple_topo.patch

Andrew Morton

ungelesen,

12.07.2002, 22:50:0412.07.02

an

Matthew Dobson wrote:
>
> Here is a very rudimentary topology API for NUMA systems. It uses prctl() for
> the userland calls, and exposes some useful things to userland. It would be
> nice to expose these simple structures to both users and the kernel itself.
> Any architecture wishing to use this API simply has to write a .h file that
> defines the 5 calls defined in core_ibmnumaq.h and include it in asm/mmzone.h.

Matt,

I suspect what happens when these patches come out is that most people simply
don't have the knowledge/time/experience/context to judge them, and nothing
ends up happening. No way would I pretend to be able to comment on the
big picture, that's for sure.

If the code is clean, the interfaces make sense, the impact on other
platforms is minimised and the stakeholders are OK with it then that
should be sufficient, yes?

AFAIK, the interested parties with this and the memory binding API are
ia32-NUMA, ia64, PPC, some MIPS and x86-64-soon. It would be helpful
if the owners of those platforms could review this work and say "yes,
this is something we can use and build upon". Have they done that?

I'd have a few micro-observations:

> ...
> --- linux-2.5.25-vanilla/kernel/membind.c Wed Dec 31 16:00:00 1969
> +++ linux-2.5.25-api/kernel/membind.c Fri Jul 12 16:13:17 2002
> ..
> +inline int memblk_to_node(int memblk)

The inlines with global scope in this file seem strange?

Matthew Dobson wrote:
>
> Here is a Memory Binding API
> ...
> + memblk_binding: { MEMBLK_NO_BINDING, MPOL_STRICT }, \

> ...
> +typedef struct memblk_list {
> + memblk_bitmask_t bitmask;
> + int behavior;
> + rwlock_t lock;
> +} memblk_list_t;

Is is possible to reduce this type to something smaller for
CONFIG_NUMA=n?

In the above task_struct initialiser you should initialise the
rwlock to RWLOCK_LOCK_UNLOCKED.

It's nice to use the `name:value' initialiser format in there, too.

> ...
> +int set_memblk_binding(memblk_bitmask_t memblks, int behavior)
> +{
> ...
> + read_lock_irqsave(&current->memblk_binding.lock, flags);

Your code accesses `current' a lot. You'll find that the code
generation is fairly poor - evaluating `current' chews 10-15
bytes of code. You can perform a manual CSE by copying current
into a local, and save a few cycles.

> ...
> +struct page * _alloc_pages(unsigned int gfp_mask, unsigned int order)
> +{
> ...
> + spin_lock_irqsave(&node_lock, flags);
> + temp = pgdat_list;
> + spin_unlock_irqrestore(&node_lock, flags);

Not sure what you're trying to lock here, but you're not locking
it ;) This is either racy code or unneeded locking.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Alexander Viro

ungelesen,

13.07.2002, 04:10:0713.07.02

an

It's hard to enjoy the use of prctl(). Especially for things like
"give me the number of the first CPU in node <n>" - it ain't no
process controll, no matter how you stretch it.

<soapbox> That's yet another demonstration of the evil of multiplexing
syscalls. They hide the broken APIs and make them easy to introduce.
And broken APIs get introduced - through each of these. prctl(), fcntl(),
ioctl() - you name it. Please, don't do that. </soapbox>

Please, replace that API with something sane. "Current processor" and
_maybe_ "current node" are reasonable per-process things (even though
the latter is obviously redundant). They are inherently racy, however -
if you get scheduled on the return from syscall the value may have
nothing to reality by the time you return to userland. The rest is
obviously system-wide _and_ not process-related (it's "tell me about
the configuration of machine"). Implementing them as prctls makes
absolutely no sense. If anything, that's sysctl material.

Albert D. Cahalan

ungelesen,

13.07.2002, 13:20:0613.07.02

an

Alexander Viro writes:

> It's hard to enjoy the use of prctl(). Especially for things like
> "give me the number of the first CPU in node <n>" - it ain't no
> process controll, no matter how you stretch it.

Yeah... eeew.

> <soapbox> That's yet another demonstration of the evil of multiplexing
> syscalls. They hide the broken APIs and make them easy to introduce.
> And broken APIs get introduced - through each of these. prctl(), fcntl(),
> ioctl() - you name it. Please, don't do that. </soapbox>

This wouldn't happen if it wasn't so damn hard to add a syscall.
If you make people go though all the arch maintainers just to
add a simple arch-independent syscall, they'll just bolt their
code into some dark hidden corner of the kernel. That's life.
Make syscalls easy to write, and this won't happen.

Can you guess what would happen if you got rid of prctl(),
fcntl(), and ioctl()? We'd get apps with code like this:

// write address of one of these to /proc/orifice
typedef struct evil {
int version; // struct version
struct evil *next; // next in list
struct evil *prev; // prev in list
char opcode; // indicates what we will do
int (*fn)(void *); // callback function (if not NULL)
void *addr; // an address in kernel memory
short flags; // 0x0001 call fn w/ ints off, 0x0002 w/ BKL
double timeout; // in microfortnights (uses APIC's NMI)
} evil;

Andi Kleen

ungelesen,

13.07.2002, 16:20:0413.07.02

an

Andrew Morton <ak...@zip.com.au> writes:

> AFAIK, the interested parties with this and the memory binding API are
> ia32-NUMA, ia64, PPC, some MIPS and x86-64-soon. It would be helpful
> if the owners of those platforms could review this work and say "yes,
> this is something we can use and build upon". Have they done that?

Comment from the x86-64 side:

Current x86-64 NUMA essentially has no 'nodes', just each CPU has
local memory that is slightly faster than remote memory. This means
the node number would be always identical to the CPU number. As long
as the API provides it's ok for me. Just the node concept will not be
very useful on that platform. memblk will also be identity mapped to
node/cpu.

Some way to tell user space about memory affinity seems to be useful,
but...

General comment:

I don't see what the application should do with the memblk concept
currently. Just knowing about it doesn't seem too useful.
Surely it needs some way to allocate memory in a specific memblk to be useful?
Also doesn't it need to know how much memory is available in each memblk?
(otherwise I don't see how it could do any useful partitioning)

-Andi

Linus Torvalds

ungelesen,

14.07.2002, 15:20:0914.07.02

an

[ I've been off-line for a week, so I didn't follow all of the discussion,
but here goes anyway ]

On 13 Jul 2002, Andi Kleen wrote:
>
> Current x86-64 NUMA essentially has no 'nodes', just each CPU has
> local memory that is slightly faster than remote memory. This means
> the node number would be always identical to the CPU number. As long
> as the API provides it's ok for me. Just the node concept will not be
> very useful on that platform. memblk will also be identity mapped to
> node/cpu.

The whole "node" concept sounds broken. There is no such thing as a node,
since even within nodes latencies will easily differ for different CPU's
if you have local memories for CPU's within a node (which is clearly the
only sane thing to do).

If you want to model memory behaviour, you should have memory descriptors
(in linux parlance, "zone_t") have an array of latencies to each CPU. That
latency is _not_ a "is this memory local to this CPU" kind of number, that
simply doesn't make any sense. The fact is, what matters is the number of
hops. Maybe you want to allow one hop, but not five.

Then, make the memory binding interface a function of just what kind of
latency you allow from a set X of CPU's. Simple, straightforward, and it
has a direct meaning in real life, which makes it unabiguous.

So your "memory affinity" system call really needs just one number: the
acceptable latency. You may also want to have a CPU-set argument, although
I suspect that it's equally correct to just assume that the CPU-set is the
set of CPU's that the process can already run on.

After that, creating a new zone array is nothing more than:

- give each zone a "latency value", which is simply the minimum of all
the latencies for that zone from CPU's that are in the CPU set.

- sort the zone array, lowest latency first.

- the passed-in latency is the cut-off-point - clear the end of the
array (with the sanity check that you always accept one zone, even if
it happens to have a latency higher than the one passed in).

End result: you end up with a priority-sorted array of acceptable zones.
In other words, a zone list. Which iz _exactly_ what you want anyway
(that's what the current "zone_table" is.

And then you associate that zone-list with the process, and use that
zone-list for all process allocations.

Advantages:

- very direct mapping to what the hardware actually does

- no complex data structures for topology

- works for all topologies, the process doesn't even have to know, you
can trivially encode it all internally in the kernel by just having the
CPU latency map for each memory zone we know about.

Disadvantages:

- you cannot create "crazy" memory bindings. You can only say "I don't
want to allocate from slow memory". You _can_ do crazy things by
initially using a different CPU binding, then doing the memory
binding, and then re-doing the CPU binding. So if you _want_ bad memory
bindings you can create them, but you have to work at it.

- we have to use some standard latency measure, either purely time-based
(which changes from machine to machine), or based on some notion of
"relative to local memory".

My personal suggestion would be the "relative to local memory" thing, and
call that 10 units. So a cross-CPU (but same module) hop might imply a
latency of 15, which a memory access that goes over the backbone between
modules might be a 35. And one that takes two hops might be 55.

So then, for each CPU in a machine, you can _trivially_ create the mapping
from each memory zone to that CPU. And that's all you really care about.

No?

Linus

Andi Kleen

ungelesen,

14.07.2002, 15:50:0914.07.02

an

On Sun, Jul 14, 2002 at 12:17:25PM -0700, Linus Torvalds wrote:
> The whole "node" concept sounds broken. There is no such thing as a node,
> since even within nodes latencies will easily differ for different CPU's
> if you have local memories for CPU's within a node (which is clearly the
> only sane thing to do).

I basically agree, but then when you go for a full graph everything
becomes very complex. It's not clear if that much detail is useful
for the application.

> latency is _not_ a "is this memory local to this CPU" kind of number, that
> simply doesn't make any sense. The fact is, what matters is the number of
> hops. Maybe you want to allow one hop, but not five.
>
> Then, make the memory binding interface a function of just what kind of
> latency you allow from a set X of CPU's. Simple, straightforward, and it
> has a direct meaning in real life, which makes it unabiguous.

Hmm - that could be a problem for applications that care less about
latency, but more about equal use of bandwidth (see below).
They just want their datastructures to be spread out evenly over
all the available memory controllers. I don't see how that could be
done with a single latency value; you really need some more complete
idea about the topology.

At least on Hammer the latency difference is small enough that
caring about the overall bandwidth makes more sense.

> And then you associate that zone-list with the process, and use that
> zone-list for all process allocations.

That's the basic idea sure for normal allocations from applications
that do not care much about NUMA.

But "numa aware" applications want to do other things like:
- put some memory area into every node (e.g. for the numa equivalent of
per CPU data in the kernel)
- "stripe" a shared memory segment over all available memory subsystems
(e.g. to use memory bandwidth fully if you know your interconnect can
take it; that's e.g. the case on the Hammer)

As I understood it this API is supposed to be the base of such an
NUMA API for applications (just offer the information, but no way
to use it usefully yet)

More comments from the NUMA gurus please.

-Andi

Eric W. Biederman

ungelesen,

14.07.2002, 22:50:0514.07.02

an

Andi Kleen <a...@suse.de> writes:
>
> At least on Hammer the latency difference is small enough that
> caring about the overall bandwidth makes more sense.

I agree. I will have to look closer but unless there is more
juice than I have seen in Hyper-Transport it is going to become
one of the architectural bottlenecks of the Hammer.

Currently you get 1600MB/s in a single direction. Not to bad.
But when the memory controllers get out to dual channel DDR-II 400,
the local bandwidth to that memory is 6400MB/s, and the bandwidth to
remote memory 1600MB/s, or 3200MB/s (if reads are as common as
writes).

So I suspect bandwidth intensive applications will really benefit
from local memory optimization on the Hammer. I can buy that the
latency is negligible, the fact the links don't appear to scale
in bandwidth as well as the connection to memory may be a bigger
issue.

> > And then you associate that zone-list with the process, and use that
> > zone-list for all process allocations.
>
> That's the basic idea sure for normal allocations from applications
> that do not care much about NUMA.
>
> But "numa aware" applications want to do other things like:
> - put some memory area into every node (e.g. for the numa equivalent of
> per CPU data in the kernel)
> - "stripe" a shared memory segment over all available memory subsystems
> (e.g. to use memory bandwidth fully if you know your interconnect can
> take it; that's e.g. the case on the Hammer)

The latter I really quite believe. Even dual channel PC2100 can
exceed your interprocessor bandwidth.

And yes I have measured 2000MB/s memory copy with an Athlon MP and
PC2100 memory.

Eric

Sandy Harris

ungelesen,

15.07.2002, 12:30:0615.07.02

an

"Eric W. Biederman" wrote:
>
> Andi Kleen <a...@suse.de> writes:
> >
> > At least on Hammer the latency difference is small enough that
> > caring about the overall bandwidth makes more sense.
>
> I agree. I will have to look closer but unless there is more
> juice than I have seen in Hyper-Transport it is going to become
> one of the architectural bottlenecks of the Hammer.
>
> Currently you get 1600MB/s in a single direction.

That's on an 8-bit channel, as used on Clawhammer (AMD's lower cost
CPU for desktop market). The spec allows 2, 4, 6, 16 or 32-bit
channels. If I recall correctly, the AMD presentation at OLS said
Sledgehammer (server market) uses 16-bit.

> Not to bad.
> But when the memory controllers get out to dual channel DDR-II 400,
> the local bandwidth to that memory is 6400MB/s, and the bandwidth to
> remote memory 1600MB/s, or 3200MB/s (if reads are as common as
> writes).
>
> So I suspect bandwidth intensive applications will really benefit
> from local memory optimization on the Hammer. I can buy that the
> latency is negligible,

I'm not so sure. Clawhammer has two links, can do dual-CPU. One link
to the other CPU, one for I/O. Latency may well be negligible there.

Sledgehammer has three links, can do no-glue 4-way with each CPU
using two links to talk to others, one for I/O.

I/O -- A ------ B -- I/O
| |
| |
I/O -- C ------ D -- I/O

They can also go to no-glue 8-way:

I/O -- A ------ B ------ E ------ G -- I/O
| | | |
| | | |
I/O -- C ------ D ------ F ------ H -- I/O

I suspect latency may become an issue when more than one link is
involved and there can be contention.

Beyond 8-way, you need glue logic (hypertransport switches?) and
latency seems bound to become an issue.

> the fact the links don't appear to scale
> in bandwidth as well as the connection to memory may be a bigger
> issue.

Chris Friesen

ungelesen,

15.07.2002, 12:40:1015.07.02

an

Sandy Harris wrote:

> I suspect latency may become an issue when more than one link is
> involved and there can be contention.

According to the AMD talk at OLS, worst case on a 4-way is better than current
best-case on a uniprocessor athlon.

> Beyond 8-way, you need glue logic (hypertransport switches?) and
> latency seems bound to become an issue.

Nope. Just extend the ladder. Each cpu talks to three other entities, either
cpu or I/O. Can be extended arbitrarily until latencies are too high.

Chris

--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: cfri...@nortelnetworks.com

Matthew Dobson

ungelesen,

15.07.2002, 14:00:0815.07.02

an

Andi Kleen wrote:
> Andrew Morton <ak...@zip.com.au> writes:
>>AFAIK, the interested parties with this and the memory binding API are
>>ia32-NUMA, ia64, PPC, some MIPS and x86-64-soon. It would be helpful
>>if the owners of those platforms could review this work and say "yes,
>>this is something we can use and build upon". Have they done that?
>
> Comment from the x86-64 side:
>
> Current x86-64 NUMA essentially has no 'nodes', just each CPU has
> local memory that is slightly faster than remote memory. This means
> the node number would be always identical to the CPU number. As long
> as the API provides it's ok for me. Just the node concept will not be
> very useful on that platform. memblk will also be identity mapped to
> node/cpu.
>
> Some way to tell user space about memory affinity seems to be useful,
> but...

That shouldn't be a problem at all. Since each architecture is responsible for
defining the 5 main topology functions, you could do this:

#define _cpu_to_node(cpu) (cpu)
#define _memblk_to_node(memblk) (memblk)
#define _node_to_node(node) (node)
#define _node_to_cpu(node) (node)
#define _node_to_memblk(node) (node)

> General comment:
>
> I don't see what the application should do with the memblk concept
> currently. Just knowing about it doesn't seem too useful.
> Surely it needs some way to allocate memory in a specific memblk to be useful?
> Also doesn't it need to know how much memory is available in each memblk?
> (otherwise I don't see how it could do any useful partitioning)

For that, you need to look at the Memory Binding API that I sent out moments
after this patch... It builds on top of this infrastructure to allow binding
processes to individual memory blocks or groups of memory blocks.

Cheers!

-Matt

Matthew Dobson

ungelesen,

15.07.2002, 15:00:0915.07.02

an

Andrew Morton wrote:
> Matt,
>
> I suspect what happens when these patches come out is that most people simply
> don't have the knowledge/time/experience/context to judge them, and nothing
> ends up happening. No way would I pretend to be able to comment on the
> big picture, that's for sure.

Absolutely correct. I know that most people here on LKML don't have 8, 16, 32,
or more CPU systems to test this code on, or for that matter, even care about
code designed for said systems. I'm lucky enough to get to work on such
machines, and I'm sure there are others out there (as evidenced by some of the
replies I've gotten) that do care. Also, there are publicly available NUMA
machines in the OSDL that people can use to "play" on large systems. I hope
that by seeing code and using these systems, some more people might get
interested in some of the interesting scalability issues that crop up with
these machines.

> If the code is clean, the interfaces make sense, the impact on other
> platforms is minimised and the stakeholders are OK with it then that
> should be sufficient, yes?

I would hope so. That's what I'm trying to establish! ;)

> AFAIK, the interested parties with this and the memory binding API are
> ia32-NUMA, ia64, PPC, some MIPS and x86-64-soon. It would be helpful
> if the owners of those platforms could review this work and say "yes,
> this is something we can use and build upon". Have they done that?

I've gotten some feedback from large systems people. I hope to get feedback
from anyone with large systems that could potentially use this kind of API, and
get a "this is great" or a "this sucks". I believe that bigger systems need
new ways to improve efficiency and scalability than what the kernel offers now.
I know I do...

> I'd have a few micro-observations:
>
>>...
>>--- linux-2.5.25-vanilla/kernel/membind.c Wed Dec 31 16:00:00 1969
>>+++ linux-2.5.25-api/kernel/membind.c Fri Jul 12 16:13:17 2002
>>..
>>+inline int memblk_to_node(int memblk)
>
>
> The inlines with global scope in this file seem strange?
>
>
> Matthew Dobson wrote:
>
>>Here is a Memory Binding API
>>...
>>+ memblk_binding: { MEMBLK_NO_BINDING, MPOL_STRICT }, \
>
>
>>...
>>+typedef struct memblk_list {
>>+ memblk_bitmask_t bitmask;
>>+ int behavior;
>>+ rwlock_t lock;
>>+} memblk_list_t;
>
>
> Is is possible to reduce this type to something smaller for
> CONFIG_NUMA=n?

Probably... I'll look at that today...

> In the above task_struct initialiser you should initialise the
> rwlock to RWLOCK_LOCK_UNLOCKED.

Yep.. Totally forgot about that! :(

> It's nice to use the `name:value' initialiser format in there, too.

Sure, enhanced readability is always a good thing!

>>...
>>+int set_memblk_binding(memblk_bitmask_t memblks, int behavior)
>>+{
>>...
>>+ read_lock_irqsave(&current->memblk_binding.lock, flags);
>
>
> Your code accesses `current' a lot. You'll find that the code
> generation is fairly poor - evaluating `current' chews 10-15
> bytes of code. You can perform a manual CSE by copying current
> into a local, and save a few cycles.

Sure.. I've actually gotten a couple different ideas about improving the
efficiency of that function, and will also be rewriting that today..

>>...
>>+struct page * _alloc_pages(unsigned int gfp_mask, unsigned int order)
>>+{
>>...
>>+ spin_lock_irqsave(&node_lock, flags);
>>+ temp = pgdat_list;
>>+ spin_unlock_irqrestore(&node_lock, flags);
>
>
> Not sure what you're trying to lock here, but you're not locking
> it ;) This is either racy code or unneeded locking.

To be honest, I'm not entirely sure what that's locking either. That is the
non-NUMA path of that function, and the locking was in the original code, so I
just moved it along. After doing a bit of searching, that lock seems
COMPLETELY useless there. Especially since in the original function, a few
lines further down pgdat_list is read again, without the lock! I guess, unless
someone here says otherwise, I'll pull that locking out of the next rev.

Thanks for all the feedback. I'll incorporate most of it into the next rev of
the patch!

Cheers!

-Matt

Jukka Honkela

ungelesen,

15.07.2002, 16:00:0615.07.02

an

Chris Friesen wrote:

>> Beyond 8-way, you need glue logic (hypertransport switches?) and
>> latency seems bound to become an issue.

>Nope. Just extend the ladder. Each cpu talks to three other entities,
>either cpu or I/O. Can be extended arbitrarily until latencies are too
>high.

You seem to be missing one critical piece from the OLS talk. The HT
protocol (or something related) can't handle more than 8 CPU's in a single
configuration. You need to have some kind of bridge to connect
more than 8CPU's together, although systems with more than 8 CPU's have
not been discussed officially anywhere, afaik.

8 CPU's and less belongs to the SUMO category (Sufficiently Uniform Memory
Organization, apparently new AMD terminology) whereas 9 CPU's and more is
likely to be NUMA.

--
Jukka Honkela

Matthew Dobson

ungelesen,

15.07.2002, 20:00:1015.07.02

an

Al,

If I can get 1-2 syscalls for the Topo API, and 1-2 for the Membind API, I'll
gladly make the changes. For now, though, prctl() works fine. If it needs to
be changed at some point, it can be done in about 5 minutes...

As far as the raciness of the get_curr_cpu & get_curr_node calls, that is noted
in the comments. Until we get a better way of exposing the current working
processor to userspace, they'll have to do. I believe that having *some* idea
of where you're running is better than having *no* idea of where you're running.

-Matt

Eric W. Biederman

ungelesen,

16.07.2002, 06:50:0816.07.02

an

Sandy Harris <pas...@storm.ca> writes:

> "Eric W. Biederman" wrote:
> >
> > Andi Kleen <a...@suse.de> writes:
> > >
> > > At least on Hammer the latency difference is small enough that
> > > caring about the overall bandwidth makes more sense.
> >
> > I agree. I will have to look closer but unless there is more
> > juice than I have seen in Hyper-Transport it is going to become
> > one of the architectural bottlenecks of the Hammer.
> >
> > Currently you get 1600MB/s in a single direction.
>
> That's on an 8-bit channel, as used on Clawhammer (AMD's lower cost
> CPU for desktop market). The spec allows 2, 4, 6, 16 or 32-bit
> channels. If I recall correctly, the AMD presentation at OLS said
> Sledgehammer (server market) uses 16-bit.

Thanks, my confusion. The danger is of having more bandwidth to memory
than to other processors is still present, but it may be one of those
places where the cpu designers are able to stay one step ahead of
the problem. I will definitely agree the problem goes away for the
short term with a 32bit link.

> > Not to bad.
> > But when the memory controllers get out to dual channel DDR-II 400,
> > the local bandwidth to that memory is 6400MB/s, and the bandwidth to
> > remote memory 1600MB/s, or 3200MB/s (if reads are as common as
> > writes).
> >
> > So I suspect bandwidth intensive applications will really benefit
> > from local memory optimization on the Hammer. I can buy that the
> > latency is negligible,
>
> I'm not so sure. Clawhammer has two links, can do dual-CPU. One link
> to the other CPU, one for I/O. Latency may well be negligible there.
>
> Sledgehammer has three links, can do no-glue 4-way with each CPU
> using two links to talk to others, one for I/O.
>
> I/O -- A ------ B -- I/O
> | |
> | |
> I/O -- C ------ D -- I/O
>
> They can also go to no-glue 8-way:
>
> I/O -- A ------ B ------ E ------ G -- I/O
> | | | |
> | | | |
> I/O -- C ------ D ------ F ------ H -- I/O

> I suspect latency may become an issue when more than one link is
> involved and there can be contention.

I think the 8-way topology is a little more interesting than
presented. But if not it does look like you can run into issues.
The more I look at it there appears to be a strong dynamic balance
in the architecture between having just enough bandwidth, and low
enough latency not to become a bottleneck, and having a low hardware
cost.

> Beyond 8-way, you need glue logic (hypertransport switches?) and
> latency seems bound to become an issue.

Beyond 8-way you get into another system architecture entirely, which
should be considered on it's own merits. In large part cache
directories and other very sophisticated techniques are needed when
you scale a system beyond the SMP point. As long as the inter-cpu
bandwidth is >= the memory bandwidth on a single memory controller
Hammer can probably get away with being just a better SMP, and not
really a NUMA design.

Eric

Rik van Riel

ungelesen,

16.07.2002, 09:10:0616.07.02

an

On 16 Jul 2002, Eric W. Biederman wrote:
> Sandy Harris <pas...@storm.ca> writes:

> > I/O -- A ------ B ------ E ------ G -- I/O
> > | | | |
> > | | | |
> > I/O -- C ------ D ------ F ------ H -- I/O
>
> > I suspect latency may become an issue when more than one link is
> > involved and there can be contention.
>
> I think the 8-way topology is a little more interesting than
> presented. But if not it does look like you can run into issues.

IIRC

I/O -- A ------ B ------ E ------ G -- I/O

| \ / |
| \ / |
| XX |
| / \ |
| / \ |

I/O -- C ------ D ------ F ------ H -- I/O

Where B is connected to F and D to E. Obviously this
setup has a maximum hop count of 3 from any cpu to any
other cpu, as opposed to a maximum hop count of 4 for
the simple ladder.

regards,

Rik
--
http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid. Go buy yourself a real t-shirt"

http://www.surriel.com/ http://distro.conectiva.com/

Martin J. Bligh

ungelesen,

16.07.2002, 11:50:0916.07.02

an

>> They can also go to no-glue 8-way:
>>
>> I/O -- A ------ B ------ E ------ G -- I/O
>> | | | |
>> | | | |
>> I/O -- C ------ D ------ F ------ H -- I/O
>
>

> I think the 8-way topology is a little more interesting than
> presented. But if not it does look like you can run into issues.
> The more I look at it there appears to be a strong dynamic balance
> in the architecture between having just enough bandwidth, and low
> enough latency not to become a bottleneck, and having a low hardware
> cost.

Whilst I don't have a definitive diagram, the "back of a napkin"
sketches we came up with at an OLS dinner looked like this:

I/O -- A ------ B ---- E ------ G -- I/O
| \/ |
| /\ |
I/O -- C ------ D ---- F ------ H -- I/O

(please excuse my poor artistic skills). That reduces the max
hops from 4 to 3 (if I haven't screwed something up).

M.

Martin J. Bligh

ungelesen,

16.07.2002, 15:10:0716.07.02

an

> The whole "node" concept sounds broken. There is no such thing as a node,
> since even within nodes latencies will easily differ for different CPU's
> if you have local memories for CPU's within a node (which is clearly the
> only sane thing to do).

Define a node as a group of CPUs with the same set of latencies to memory.
Then you get something that makes sense for everyone, and reduces the
storage of duplicated data. If your latencies for each CPU are different,
define a 1-1 mapping between nodes and CPUs. If you really want to store
everthing for each CPU, that's fine.

> If you want to model memory behaviour, you should have memory descriptors
> (in linux parlance, "zone_t") have an array of latencies to each CPU. That
> latency is _not_ a "is this memory local to this CPU" kind of number, that
> simply doesn't make any sense. The fact is, what matters is the number of
> hops. Maybe you want to allow one hop, but not five.

I can't help thinking that we'd be better off making the mechanism as generic
as possible, and not trying to predict all the wierd and wonderful things people
might want to do (eg striping), then implement what you describe as a policy
decision.

M.

Matthew Dobson

ungelesen,

16.07.2002, 18:40:0716.07.02

an

Linus Torvalds wrote:
> [ I've been off-line for a week, so I didn't follow all of the discussion,
> but here goes anyway ]
>
> On 13 Jul 2002, Andi Kleen wrote:
>
>>Current x86-64 NUMA essentially has no 'nodes', just each CPU has
>>local memory that is slightly faster than remote memory. This means
>>the node number would be always identical to the CPU number. As long
>>as the API provides it's ok for me. Just the node concept will not be
>>very useful on that platform. memblk will also be identity mapped to
>>node/cpu.
>
>
> The whole "node" concept sounds broken. There is no such thing as a node,
> since even within nodes latencies will easily differ for different CPU's
> if you have local memories for CPU's within a node (which is clearly the
> only sane thing to do).

If you're saying local memories for *each* CPU within a node, then no, that is
not the only sane thing to do. There are some architectures that do, and some
that do not. The Hammer architecture, to the best of my knowledge, has memory
hanging off of each CPU, however, NUMA-Q, the main one I work with, has local
memory for each group of 4 CPUs. If you're speaking only of node-local memory,
ie: memory local to all the CPUs on the 'node', then all local CPUs should have
the same latency to that memory.

> If you want to model memory behaviour, you should have memory descriptors
> (in linux parlance, "zone_t") have an array of latencies to each CPU. That
> latency is _not_ a "is this memory local to this CPU" kind of number, that
> simply doesn't make any sense. The fact is, what matters is the number of
> hops. Maybe you want to allow one hop, but not five.
>
> Then, make the memory binding interface a function of just what kind of
> latency you allow from a set X of CPU's. Simple, straightforward, and it
> has a direct meaning in real life, which makes it unabiguous.

I mostly agree with you here, except I really do believe that we should use the
node abstraction. It adds little overhead, but buys us a good bit. Nodes,
according to the API, are defined on a per-arch basis, allowing for us to
sanely define nodes on our NUMA-Q hardware (node==4cpus), AMD people to sanely
define nodes on there hardware (node==cpu), and others to define nodes to
whatever they want. We will avoid redundant data in many cases, and in the
simplest case, this defaults to your node==cpu behavior anyway. If we do use
CPU-Mem latencies, the NUMA-Q platform (and I'm sure others) would only be able
to distinguish between local and remote CPUs, not individual remote CPUs.

> So your "memory affinity" system call really needs just one number: the
> acceptable latency. You may also want to have a CPU-set argument, although
> I suspect that it's equally correct to just assume that the CPU-set is the
> set of CPU's that the process can already run on.
>
> After that, creating a new zone array is nothing more than:
>
> - give each zone a "latency value", which is simply the minimum of all
> the latencies for that zone from CPU's that are in the CPU set.
>
> - sort the zone array, lowest latency first.
>
> - the passed-in latency is the cut-off-point - clear the end of the
> array (with the sanity check that you always accept one zone, even if
> it happens to have a latency higher than the one passed in).
>
> End result: you end up with a priority-sorted array of acceptable zones.
> In other words, a zone list. Which iz _exactly_ what you want anyway
> (that's what the current "zone_table" is.
>
> And then you associate that zone-list with the process, and use that
> zone-list for all process allocations.

It seems as though you'd be throwing out some useful data. For example,
imagine you have a 2 quad NUMAQ system. Each quad contains 4 CPUs and a block
of memory. Now if use all of the CPUs as our CPU set, zone 0 (memory block on
quad 0) will have a latency of 1 (b/c it is one hop from the first 4 cpus), as
will zone 1 (memory block on quad 1), b/c it is one hop from the second 4 cpus.
Now it would appear that since these zones both have the same latency, they
would be eqally good choices. This isn't true, since if the process is on CPU
0-3, it should allocate on zone 0, and vice versa for CPUs 4-7. Latency
shouldn't be the ONLY way to make decisions.

> Advantages:
>
> - very direct mapping to what the hardware actually does

For some architectures, but for some it isn't.

> - no complex data structures for topology

Agreed.

> - works for all topologies, the process doesn't even have to know, you
> can trivially encode it all internally in the kernel by just having the
> CPU latency map for each memory zone we know about.

True, but the point of this is API is to allow for processes that *DO* want to
know to make intelligent decisions! Those that don't care, can still go on,
blissfully unaware they are on a NUMA system.

> Disadvantages:
>
> - you cannot create "crazy" memory bindings. You can only say "I don't
> want to allocate from slow memory". You _can_ do crazy things by
> initially using a different CPU binding, then doing the memory
> binding, and then re-doing the CPU binding. So if you _want_ bad memory
> bindings you can create them, but you have to work at it.

Why limit the process? The overhead is so small to allow processes to do
anything they want, why not allow them?

> - we have to use some standard latency measure, either purely time-based
> (which changes from machine to machine), or based on some notion of
> "relative to local memory".
>
> My personal suggestion would be the "relative to local memory" thing, and
> call that 10 units. So a cross-CPU (but same module) hop might imply a
> latency of 15, which a memory access that goes over the backbone between
> modules might be a 35. And one that takes two hops might be 55.

Absolutely true. I think that the "relative to local memory" is a great
measuring stick. It is pretty much platform agnostic, assuming every platform
has some concept of "local" memory.

I basically think that we should give processes that care the ability to do
just about anything they want, no matter how crazy... Most processes will
never even attempt to look at their default bindings, never mind change them.
Plus, were making mechanism decisions that will (hopefully) be around for some
time. I'm sure people will come up with things we can't even imagine, so the
more powerful the API the better.

-Matt

Michael Hohnbaum

ungelesen,

16.07.2002, 20:30:0916.07.02

an

On Sun, 2002-07-14 at 12:17, Linus Torvalds wrote:
>
> [ I've been off-line for a week, so I didn't follow all of the discussion,
> but here goes anyway ]
>
> On 13 Jul 2002, Andi Kleen wrote:
> >
> > Current x86-64 NUMA essentially has no 'nodes', just each CPU has
> > local memory that is slightly faster than remote memory. This means
> > the node number would be always identical to the CPU number. As long
> > as the API provides it's ok for me. Just the node concept will not be
> > very useful on that platform. memblk will also be identity mapped to
> > node/cpu.
>
> The whole "node" concept sounds broken. There is no such thing as a node,
> since even within nodes latencies will easily differ for different CPU's
> if you have local memories for CPU's within a node (which is clearly the
> only sane thing to do).
>
> If you want to model memory behaviour, you should have memory descriptors
> (in linux parlance, "zone_t") have an array of latencies to each CPU. That
> latency is _not_ a "is this memory local to this CPU" kind of number, that
> simply doesn't make any sense. The fact is, what matters is the number of
> hops. Maybe you want to allow one hop, but not five.

How NUMA binding APIs have been used successfully is for a group of
processes to all decide to "hang out" together in the same vicinity
so that they can optimize access to shared memory. In existing NUMA
systems, this has been deciding to execute on a subset of the nodes.
In a system with 4 nodes, on NUMAQ machines, the latency is either
local or remote. Thus if one sets a binding/latency argument such
that remote accesses are allowed, then all of the nodes are fair game,
otherwise only the local node is used. So a set of processes decides
to occupy two nodes. Using strictly a latency argument, there is no
way to specify this. One could use cpu binding to restrict the
processes to the two nodes, but not the memory - unless you now
associate cpus with memory/nodes and are back to maintaining
topology info.

Another shortcoming of latency based binding is if a process executes
on a node for awhile, then gets moved to another node. In that case
the best memory allocation depends on several factors. The first is
to determine where we measure latency from - the node the process is
currently executing or where it had been executing? From a scheduling
perspective we are playing with the idea of a home node. If a
process is dispatched off of the home node should memory allocations
come from the home node or the current node? If neither has available
memory should latency be considered from home or current?

The other way that memory binding has been used is for a large process,
typically a database, to want control over where it places memory and
data structures. It is not a latency issue, but rather one of how
the work is distributed across the system.

The memory binding that Matt proposed allows an architecture to
define what a memory block is, and the application to determine
how it wants to bind to the memory blocks. It is only intended
for use by applications that are aware of the complexities
involved, and these will have to be knowledgeable about the
systems that they are on. Hopefully, by default, Linux will
do the right things as far as memory placement for processes
that choose to leave it up to the system - which will be the
majority of apps. However, the apps that want to specify their
memory placement, want the ability to have explicit control
over the nodes it lands on. It is not strictly a latency based
decision.

Michael
--

Michael Hohnbaum 503-578-5486
hohn...@us.ibm.com T/L 775-5486