Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

transactional memory hype!

18 views
Skip to first unread message

Chris Thomasson

unread,
Aug 23, 2007, 11:51:29 PM8/23/07
to
WTF are the chip vendors bogging down their chips cache-coherency system
designs with transactional memory bullshit!? I read somewhere that Intel is
thinking about doing HTM for some upcoming designs. Why?


http://groups.google.com/group/comp.arch/search?hl=en&group=comp.arch&q=%22chris+thomasson%22+transactional+memory
(here are some of my worries!)


I guess the chip vendors want their coherency systems to be at one with the
live-locking in TM anything!

:^/

Dmitriy Vyukov

unread,
Aug 24, 2007, 1:58:28 AM8/24/07
to
On 24 , 07:51, "Chris Thomasson" <cris...@comcast.net> wrote:
> WTF are the chip vendors bogging down their chips cache-coherency system
> designs with transactional memory bullshit!? I read somewhere that Intel is
> thinking about doing HTM for some upcoming designs. Why?

It seems that relaxed consistency and high-performance distributed
parallel computing is synonyms.
You can see design of Google File System:
http://209.85.163.132/papers/gfs-sosp2003.pdf
They include relaxed consistency even on file system level to achieve
real distribution. GFS can work on clusters of thousands of machines
exactly because of relaxed consistency.

Dmitriy V'jukov

David Wragg

unread,
Aug 24, 2007, 3:55:53 AM8/24/07
to
On Aug 24, 4:51 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> WTF are the chip vendors bogging down their chips cache-coherency system
> designs with transactional memory bullshit!? I read somewhere that Intel is
> thinking about doing HTM for some upcoming designs. Why?

Because they don't have any better ideas about how to get all
that legacy software that isn't multithreaded to run in parallel
across the dozens of cores they will soon have. Will it work?
We'll see over the next decade. Azul is already promising they
can do this for Java with their Java big-iron (which has HTM).

David


Dmitriy Vyukov

unread,
Aug 24, 2007, 4:54:50 AM8/24/07
to
On Aug 24, 11:55 am, David Wragg <da...@wragg.org> wrote:

> Because they don't have any better ideas about how to get all
> that legacy software that isn't multithreaded to run in parallel
> across the dozens of cores they will soon have.

Main question: what about new, multicore-aware software?

Will STM mode be optional? I.e. can I get plain-old-relaxed memory
model for my multicore-aware software?

Dmitriy V'jukov

Marcel Müller

unread,
Aug 24, 2007, 5:01:39 AM8/24/07
to
Dmitriy Vyukov schrieb:

> On 24 , 07:51, "Chris Thomasson" <cris...@comcast.net> wrote:
>> WTF are the chip vendors bogging down their chips cache-coherency system
>> designs with transactional memory bullshit!? I read somewhere that Intel is
>> thinking about doing HTM for some upcoming designs. Why?
>
> It seems that relaxed consistency and high-performance distributed
> parallel computing is synonyms.

That's true and unlikely to be overcome.

Einstein's theory of relativity defines events that are closer together
in time than in space as necessarily uncorrelated. The conversion factor
is the speed of light, or in electronics about 2/3 of it.
So the time window of a synchronisation point is at least as wide as the
two locations that are to be synchronized are away in units of c.

We are pretty close to this limit. At 3GHz the area which might be
synchronized within one clock cycle under ideal conditions is limited to
about 3,5cm. Since you have to include switching times too it is less.

Moving the synchronization from the CPUs to a common unit (e.g. the
memory) reduces the synchronization delays.


Marcel

David Wragg

unread,
Aug 24, 2007, 6:14:22 AM8/24/07
to
On Aug 24, 9:54 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> Main question: what about new, multicore-aware software?
>
> Will STM mode be optional? I.e. can I get plain-old-relaxed memory
> model for my multicore-aware software?

My guess (and I'm just a software guy who's read a little about
the TM proposals and the early implementations) is that TM will
always be optional in a technical sense. But if it works, or if
it enough people are convinced that it will work, there will be
strong pressures on software projects to just rely on TM, and not
to worry about designing software that will scale to many cores
using other techniques.


David

Dmitriy Vyukov

unread,
Aug 24, 2007, 7:11:15 AM8/24/07
to
On Aug 24, 2:14 pm, David Wragg <da...@wragg.org> wrote:

> My guess (and I'm just a software guy who's read a little about
> the TM proposals and the early implementations) is that TM will
> always be optional in a technical sense.

So, Chris, there is nothing to worry about. You still will be able to
create more scalable and high-performance software ;)

> But if it works, or if
> it enough people are convinced that it will work, there will be
> strong pressures on software projects to just rely on TM, and not
> to worry about designing software that will scale to many cores
> using other techniques.

Oh. It seems that you must hurry, to not permit this convenience ;)

Dmitriy V'jukov

Chris Thomasson

unread,
Aug 25, 2007, 11:32:25 PM8/25/07
to
"David Wragg" <da...@wragg.org> wrote in message
news:1187950462....@r23g2000prd.googlegroups.com...

I am not so sure that TM can reliably scale to many cores without simply
flooding everything with costly coherency traffic...

Chris Thomasson

unread,
Aug 25, 2007, 11:33:23 PM8/25/07
to
"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1187953875.9...@x35g2000prf.googlegroups.com...

> On Aug 24, 2:14 pm, David Wragg <da...@wragg.org> wrote:
>
>> My guess (and I'm just a software guy who's read a little about
>> the TM proposals and the early implementations) is that TM will
>> always be optional in a technical sense.
>
> So, Chris, there is nothing to worry about. You still will be able to
> create more scalable and high-performance software ;)
[...]

> Oh. It seems that you must hurry, to not permit this convenience ;)

:^0

Chris Thomasson

unread,
Aug 25, 2007, 11:38:09 PM8/25/07
to
"Marcel Müller" <news.5...@spamgourmet.com> wrote in message
news:46ce9ebe$0$30369$9b4e...@newsspool4.arcor-online.net...

> Dmitriy Vyukov schrieb:
>> On 24 , 07:51, "Chris Thomasson" <cris...@comcast.net> wrote:
>>> WTF are the chip vendors bogging down their chips cache-coherency system
>>> designs with transactional memory bullshit!? I read somewhere that Intel
>>> is
>>> thinking about doing HTM for some upcoming designs. Why?
>>
>> It seems that relaxed consistency and high-performance distributed
>> parallel computing is synonyms.
>
> That's true and unlikely to be overcome.
[...]

> Moving the synchronization from the CPUs to a common unit (e.g. the
> memory) reduces the synchronization delays.

Yup. I want a massively multi-core cpu to be directly integrated onto a bank
of per-cpu "dedicated" underlying memory. Think of a physically large chip
sitting on top of gb(s) of pure "on-board integrated memory":

http://groups.google.com/group/comp.arch/msg/c9d99ae1251f2462

The physical distance from the chip and its dedicated memory bank could be
very-very small indeed!

Chris Thomasson

unread,
Aug 25, 2007, 11:39:53 PM8/25/07
to
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:TvednVD-V4O6aE3b...@comcast.com...

> "Marcel Müller" <news.5...@spamgourmet.com> wrote in message
> news:46ce9ebe$0$30369$9b4e...@newsspool4.arcor-online.net...
>> Dmitriy Vyukov schrieb:
>>> On 24 , 07:51, "Chris Thomasson" <cris...@comcast.net> wrote:
>>>> WTF are the chip vendors bogging down their chips cache-coherency
>>>> system
>>>> designs with transactional memory bullshit!? I read somewhere that
>>>> Intel is
>>>> thinking about doing HTM for some upcoming designs. Why?
>>>
>>> It seems that relaxed consistency and high-performance distributed
>>> parallel computing is synonyms.
>>
>> That's true and unlikely to be overcome.
> [...]
>> Moving the synchronization from the CPUs to a common unit (e.g. the
>> memory) reduces the synchronization delays.
>
> Yup. I want a massively multi-core cpu to be directly integrated onto a
> bank of per-cpu "dedicated" underlying memory.

[...]

This would be a "super computer on a chip", so to speak...


Dmitriy Vyukov

unread,
Aug 26, 2007, 12:04:25 PM8/26/07
to
On 26 , 07:39, "Chris Thomasson" <cris...@comcast.net> wrote:

> > Yup. I want a massively multi-core cpu to be directly integrated onto a
> > bank of per-cpu "dedicated" underlying memory.
>
> [...]
>
> This would be a "super computer on a chip", so to speak...

We have this already. This is called NUMA ;)
But NUMA have more complicated model, and it's harder to program for
it... So this is no more the plain-old-SMP... I think that there is a
handful of people which can effectively program for NUMA....
And yes, this seems the only way to scalable systems...

Btw, AMD claims that their multicore chips is NUMA a little. There are
the same distance between all cores and memory bank, but interface to
memory bank is 2-way. And first core have priority in access on first
way, and second core have priority in access on second way. As I
understand.

Dmitriy V'jukov

David Wragg

unread,
Aug 26, 2007, 1:15:03 PM8/26/07
to
On Aug 26, 4:32 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> I am not so sure that TM can reliably scale to many cores without simply
> flooding everything with costly coherency traffic...

Or, as core counts increase, legacy software parallelized
using TM might start to hit a wall of livelock, deadlock, or
limited parallelism due to dependencies between transactions.

But if TM is reasonably effective up to some number of cores,
then it will be used by the parts of the software world that have
to large amounts of legacy code to support. The question of what
to do when single core performance hits its limits gets answered
by TM. Perhaps a few years later another question arises, what
to do when TM hits its limits? TM buys a few years of increased
performance with minimal code changes.

Optimists might hope that by the time TM hits its limits, other
parts of the software industry, not so dependent on legacy code
bases, will have pioneered other longer-term approaches to
developing software that exploits large numbers of cores.

David

Dmitriy Vyukov

unread,
Aug 26, 2007, 1:59:47 PM8/26/07
to
On 26 , 21:15, David Wragg <da...@wragg.org> wrote:

> Perhaps a few years later another question arises, what
> to do when TM hits its limits? TM buys a few years of increased
> performance with minimal code changes.

It is unlikely that hardware guys will implement some feature that
will serve for only few years... To be implemented feature must serve
for at least few decades!

Dmitriy V'jukov

Message has been deleted

Dmitriy Vyukov

unread,
Aug 26, 2007, 4:57:54 PM8/26/07
to
On 27 , 00:30, Elcaro Nosille <Elcaro.Nosi...@mailinator.com>
wrote:
> Chris Thomasson schrieb:

>
> > WTF are the chip vendors bogging down their chips cache-coherency
> > system designs with transactional memory bullshit!? I read somewhere
> > that Intel is thinking about doing HTM for some upcoming designs. Why?
>
> It's obviously grievous for you that with support for transactional
> memory, lock-free programming becomes very easy and your knowledge
> about lock-free programming with the usual CAS- or LL/SC-methods
> will be less acknowledged.

And what about high-performance/scalable low-overhead custom
synchronization methods? :)

Dmitriy V'jukov

Message has been deleted

Chris Thomasson

unread,
Aug 26, 2007, 9:06:24 PM8/26/07
to
"Elcaro Nosille" <Elcaro....@mailinator.com> wrote in message
news:46d20070$0$16111$9b4e...@newsspool1.arcor-online.net...
> Dmitriy Vyukov schrieb:

>
>>> It's obviously grievous for you that with support for transactional
>>> memory, lock-free programming becomes very easy and your knowledge
>>> about lock-free programming with the usual CAS- or LL/SC-methods
>>> will be less acknowledged.
>
>> And what about high-performance/scalable low-overhead custom
>> synchronization methods? :)
>
> I've not a problem when my influence on other peoples programming-style
> is low.

So be it.

Chris Thomasson

unread,
Aug 26, 2007, 9:12:06 PM8/26/07
to
"Elcaro Nosille" <Elcaro....@mailinator.com> wrote in message
news:46d1e270$0$16108$9b4e...@newsspool1.arcor-online.net...
> Chris Thomasson schrieb:

>
>> WTF are the chip vendors bogging down their chips cache-coherency
>> system designs with transactional memory bullshit!? I read somewhere
>> that Intel is thinking about doing HTM for some upcoming designs. Why?
>
> It's obviously grievous for you that with support for transactional
> memory, lock-free programming becomes very easy

You have to know what your doing in order to even think about programming
for a transactional memory system. The way you design things can effect the
level of live-lock your most-likely going to experience during periods of
moderate-to-severe load on the sub-system...

> and your knowledge
> about lock-free programming with the usual CAS- or LL/SC-methods
> will be less acknowledged.

Well, I have always wanted to clean one of my in-house STM designs off for a
while now. Perhaps my skills with interlocked rmw and memory barrier
instructions into building a 'Minimalist STM Sub-System'. Do you think that
would be a good idea?

Chris Thomasson

unread,
Aug 26, 2007, 9:16:25 PM8/26/07
to
"John Dallman" <j...@cix.co.uk> wrote in message
news:memo.2007082...@jgd.compulink.co.uk...
> In article <1188151187....@w3g2000hsg.googlegroups.com>,

> dvy...@gmail.com (Dmitriy Vyukov) wrote:
>
>> It is unlikely that hardware guys will implement some feature that
>> will serve for only few years... To be implemented feature must serve
>> for at least few decades!
>
> Yes. But how many hardware features have been implemented, and then
> fallen by the wayside? If you're lucky, nobody uses them at all, and you
> can leave them out of the next revision. Do not underestimate the power
> of marketing departments, who think on three-year timescale as an
> absolute maximum, in deciding on new hardware features. That's what got
> us the Pentium 4, for example.

It seems that with multi-core the vendors are desperately in need of
programmers that can create high-performance concurrent software. So, I
guess HTM would be a sort of failsafe for a chip vendor. They don't seem to
realize that their existing instruction sets can be made to scale to very
many cores using low-overhead software-based synchronization algorithms...

Dmitriy Vyukov

unread,
Aug 27, 2007, 12:30:40 AM8/27/07
to
On 27 , 05:16, "Chris Thomasson" <cris...@comcast.net> wrote:

> It seems that with multi-core the vendors are desperately in need of
> programmers that can create high-performance concurrent software. So, I
> guess HTM would be a sort of failsafe for a chip vendor. They don't seem to
> realize that their existing instruction sets can be made to scale to very
> many cores using low-overhead software-based synchronization algorithms...

Hmm. Multi-core chip vendors can include commercial vZOOM license with
every box :)
It would be like notebook with preset OS and other stuff. Do you
suggest this idea to them? :)

Dmitriy V'jukov

Chris Thomasson

unread,
Aug 27, 2007, 12:53:54 AM8/27/07
to
"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1188189040.2...@50g2000hsm.googlegroups.com...

Well, I suggested this:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb87f9cb2cb62d84

to the following forum:

http://softwarecommunity.intel.com/isn/Community/en-US/forums/1042/ShowForum.aspx

And have no reply. They deleted my first post, and I can't seen to find my
second post which survived for about a month it seems:

http://groups.google.com/group/comp.programming.threads/msg/b0efcf84079aa780

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/45eb52c737cf5997

:^0

Chris Thomasson

unread,
Aug 27, 2007, 1:01:17 AM8/27/07
to
"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1188189040.2...@50g2000hsm.googlegroups.com...
> On 27 , 05:16, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> It seems that with multi-core the vendors are desperately in need of
>> programmers that can create high-performance concurrent software. So, I
>> guess HTM would be a sort of failsafe for a chip vendor. They don't seem
>> to
>> realize that their existing instruction sets can be made to scale to very
>> many cores using low-overhead software-based synchronization
>> algorithms...
>
> Hmm. Multi-core chip vendors can include commercial vZOOM license with
> every box :)

[...]

That would be a fairy tale, no?

:^(...


Chris Thomasson

unread,
Aug 27, 2007, 1:07:25 AM8/27/07
to
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:742dnR145fj4xU_b...@comcast.com...

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
> news:1188189040.2...@50g2000hsm.googlegroups.com...
>> On 27 , 05:16, "Chris Thomasson" <cris...@comcast.net> wrote:
>>
>>> It seems that with multi-core the vendors are desperately in need of
>>> programmers that can create high-performance concurrent software. So, I
>>> guess HTM would be a sort of failsafe for a chip vendor. They don't seem
>>> to
>>> realize that their existing instruction sets can be made to scale to
>>> very
>>> many cores using low-overhead software-based synchronization
>>> algorithms...
>>
>> Hmm. Multi-core chip vendors can include commercial vZOOM license with
>> every box :)
>> It would be like notebook with preset OS and other stuff. Do you
>> suggest this idea to them? :)
>
> Well, I suggested this:
>
> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb87f9cb2cb62d84
>
> to the following forum:
>
> http://softwarecommunity.intel.com/isn/Community/en-US/forums/1042/ShowForum.aspx
>
> And have no reply. They deleted my first post, and I can't seen to find my
> second post which survived for about a month it seems:

[...]

http://groups.google.com/group/comp.programming.threads/msg/fd1c7c4ce94778c5

DAMN!

merri...@gmail.com

unread,
Aug 27, 2007, 2:01:33 AM8/27/07
to
On Aug 26, 6:16 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> "John Dallman" <j...@cix.co.uk> wrote in message
>
> news:memo.2007082...@jgd.compulink.co.uk...
>
> > In article <1188151187.525476.65...@w3g2000hsg.googlegroups.com>,

> > dvyu...@gmail.com (Dmitriy Vyukov) wrote:
>
> >> It is unlikely that hardware guys will implement some feature that
> >> will serve for only few years... To be implemented feature must serve
> >> for at least few decades!
>
> > Yes. But how many hardware features have been implemented, and then
> > fallen by the wayside? If you're lucky, nobody uses them at all, and you
> > can leave them out of the next revision. Do not underestimate the power
> > of marketing departments, who think on three-year timescale as an
> > absolute maximum, in deciding on new hardware features. That's what got
> > us the Pentium 4, for example.
>
> It seems that with multi-core the vendors are desperately in need of
> programmers that can create high-performance concurrent software. So, I
> guess HTM would be a sort of failsafe for a chip vendor. They don't seem to
> realize that their existing instruction sets can be made to scale to very
> many cores using low-overhead software-based synchronization algorithms...

I think that you may be able to find ways to use HTM to make your low-
overhead software-based synchronization algorithms perform better than
they would otherwise. Think of HTM as a hint to the processor that a
sequence of dependent reads and writes is about to take place that it
should attempt to minimize contention with other processors. At the
lowest level I only need a couple of queue primitives that are
synchronized across processors. I don't really care how complex the
implementation ends up being, if HTM can make them simpler, faster or
more reliable than a lockless method then it is a win.

Chris Thomasson

unread,
Aug 27, 2007, 2:37:28 AM8/27/07
to
<merri...@gmail.com> wrote in message
news:1188194493.2...@z24g2000prh.googlegroups.com...

I can't quite see how vZOOM can benefit from HTM. The library tries to avoid
using interlocked rmw and/or memory barrier operations. A call to a HTM
instruction-set would , IMO, be an expensive call... BTW, there would have
to be two calls. A begin, and a subsequent commit. If your dealing with a
scenario that has millions of read and thousands of write transactions on
sort-of persistent basis, then I want to avoid calling interlocked anything,
let alone HTM. Please note that this is not because of a bias toward not
using HTM. I want to avoid calling all expensive operations, which HTM
instruction-set will most likely be a part of. Well, when compared to the
basic instructions rmw instructions at lease. BTW, how are you going to
design a HTM that can have a low-overhead cache-coherency? You can coarsen
the grain for TM to increase atomic-block performance, but then you increase
the overall live-lock!

Chris Thomasson

unread,
Aug 27, 2007, 2:41:42 AM8/27/07
to
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:B82dnTULj-s07U_b...@comcast.com...
[...]

> I can't quite see how vZOOM can benefit from HTM. The library tries to
> avoid using interlocked rmw and/or memory barrier operations. A call to a
> HTM instruction-set would , IMO, be an expensive call... BTW, there would
> have to be two calls. A begin, and a subsequent commit.
[...]

That is two-calls in a loop that is not guaranteed to be committed on the
first try.... Please keep in mind that the loop comprises a full
atomic-block. This block may be iterating over large linked data-structures.
The depth of the linked collections are directly related to the level of
live-lock you will experience.

vZOOM can allow multiple reader thread to iterate throughout multiple
unrelated linked-data-structures _concurrently with multiple writer threads
making various modifications_, without _any_ livelock...

Chris Thomasson

unread,
Aug 27, 2007, 2:47:44 AM8/27/07
to
[...]

> Please note that this is not because of a bias toward not
> using HTM. I want to avoid calling all expensive operations, which HTM
> instruction-set will most likely be a part of.


> Well, when compared to the
> basic instructions rmw instructions at lease.

^^^^^^^^^^^^^
[...]

That sentence needs to read as:

'Well, when compared to basic interlocked rmw instruction at least.'

Piotr Wyderski

unread,
Aug 27, 2007, 9:21:25 AM8/27/07
to
Chris Thomasson wrote:

> It seems that with multi-core the vendors are desperately in need of
> programmers that can create high-performance concurrent software. So, I
> guess HTM would be a sort of failsafe for a chip vendor. They don't seem
> to realize that their existing instruction sets can be made to scale to
> very many cores using low-overhead software-based synchronization
> algorithms...

What "existing instruction sets" are you writing about?
That of Sparc, which doesn't include wide CAS in
64-bit mode? Or perhaps you mean IA-32 with its
lock-based instructions which are so slow that barely
usable? Been there, done that -- lock-free algorithms
are a great alternative for many lock-based solutions,
but even then their performance is simply a disaster.
Could the chip vendors give us an extremely fast
point-to-point communication framework which can
send a message from procesor A directly to processor
B instead of HTM, STM or similar insanities? There
is such a bus, based on the local APIC units and used
for IPIs, but it's way too slow and too primitive. It
is perfectly doable, for instance the Virtex 4 FPGA
chip has many multi-GHz differential lanes to provide
amazingly fast burst transfers. Why can't Core2
have 4 more pins (2 for diff in, 2 for diff out) and
pump its IPC data into the next CPU's receive buffer
at 2GHz and simultaneously receive something from
the previous CPU? A simple token ring topology
would be enough for small systems and a P2P
router chip can be added for the larger ones.

Best regards
Piotr Wyderski

Dmitriy Vyukov

unread,
Aug 27, 2007, 10:07:45 AM8/27/07
to
On Aug 27, 5:21 pm, "Piotr Wyderski" <wyder...@mothers.against.spam-
ii.uni.wroc.pl> wrote:

> Been there, done that -- lock-free algorithms
> are a great alternative for many lock-based solutions,
> but even then their performance is simply a disaster.

Accidentally don't you talking about those bus-saturating 5-CAS/
operation algorithm by Michael, Valois etc? :)))

Dmitriy V'jukov

Piotr Wyderski

unread,
Aug 27, 2007, 10:03:36 AM8/27/07
to
Dmitriy Vyukov wrote:

> And what about high-performance/scalable low-overhead custom
> synchronization methods? :)

"Low overhead" and "synchronization" in the same sentence?
There is no such beast. You may call it "lower overhead" and
you will be perfectly correct, but it is far from being "low".

Best regards
Piotr Wyderski


Piotr Wyderski

unread,
Aug 27, 2007, 10:15:17 AM8/27/07
to
Dmitriy Vyukov wrote:

> Accidentally don't you talking about those bus-saturating 5-CAS/
> operation algorithm by Michael, Valois etc? :)))

No, about a single CAS/operation concurrent FIFO.

Best regards
Piotr Wyderski

Dmitriy Vyukov

unread,
Aug 27, 2007, 12:20:51 PM8/27/07
to
On Aug 27, 6:03 pm, "Piotr Wyderski" <wyder...@mothers.against.spam-

What about lock-free reader pattern?
No interlocked rmw or expensive memory barriers at all on reader side.

Dmitriy V'jukov

Dmitriy Vyukov

unread,
Aug 27, 2007, 12:27:13 PM8/27/07
to
On Aug 27, 6:15 pm, "Piotr Wyderski" <wyder...@mothers.against.spam-

ii.uni.wroc.pl> wrote:
> Dmitriy Vyukov wrote:
> > Accidentally don't you talking about those bus-saturating 5-CAS/
> > operation algorithm by Michael, Valois etc? :)))
>
> No, about a single CAS/operation concurrent FIFO.

What type of FIFO queue are you talking about?
Multiple-producer/multiple-consumer queue must have more CAS/
operation.
What type of FIFO queue can have single CAS/operation?

Dmitriy V'jukov

Piotr Wyderski

unread,
Aug 27, 2007, 1:32:34 PM8/27/07
to
Dmitriy Vyukov wrote:

> What about lock-free reader pattern?

Hardly ever applicable.

Best regards
Piotr Wyderski

Dmitriy Vyukov

unread,
Aug 27, 2007, 2:08:44 PM8/27/07
to
On Aug 27, 9:32 pm, "Piotr Wyderski" <wyder...@mothers.against.spam-

ii.uni.wroc.pl> wrote:
> Dmitriy Vyukov wrote:
> > What about lock-free reader pattern?
>
> Hardly ever applicable.

Why exactly? Please, explain in detail.

Dmitriy V'jukov

Chris Thomasson

unread,
Aug 27, 2007, 2:21:29 PM8/27/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fauj4e$725$1...@news.dialog.net.pl...

> Chris Thomasson wrote:
>
>> It seems that with multi-core the vendors are desperately in need of
>> programmers that can create high-performance concurrent software. So, I
>> guess HTM would be a sort of failsafe for a chip vendor. They don't seem
>> to realize that their existing instruction sets can be made to scale to
>> very many cores using low-overhead software-based synchronization
>> algorithms...
>
> What "existing instruction sets" are you writing about?
> That of Sparc, which doesn't include wide CAS in
> 64-bit mode? Or perhaps you mean IA-32 with its
> lock-based instructions which are so slow that barely
> usable?

Current SPARC and IA-32/64 instruction set works fine. You don't really need
DWCAS. As for the lock prefix on x86, well, you can design around using
them. Have you seen my memory allocator algorithm that does this? I have
posted it to this group a couple of times.


> Been there, done that -- lock-free algorithms
> are a great alternative for many lock-based solutions,
> but even then their performance is simply a disaster.

You mean performance for poorly designed lock-free algorithms are a disaster
right?


> Could the chip vendors give us an extremely fast
> point-to-point communication framework which can
> send a message from procesor A directly to processor
> B instead of HTM, STM or similar insanities?

Sure, but message-passing algorithms are really not that much help for
certain synchronization needs. For instance, when your reader threads need
to iterate through large linked collections message-passing is not going to
help much...

[...]


Chris Thomasson

unread,
Aug 27, 2007, 2:26:33 PM8/27/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fauljf$8mt$1...@news.dialog.net.pl...

I disagree. Look here for a simple example:

http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855

The algorithm I invented can implement multi-threaded allocator that can
scale to very many cores and has virtually zero-overheads. How can you say
that the overhead is far from being low? Here is a lock-free FIFO queue that
uses no interlocked rmw or store/load memory barriers:

http://groups.google.com/group/comp.lang.c++.moderated/msg/9bdc5f50be08d7d6

How can you say that algorithms is far from being low? BTW, that queue can
be multiplexed to provide high-speed message-passing between multiple CPU's.
I have more examples if you want.

Chris Thomasson

unread,
Aug 27, 2007, 2:28:07 PM8/27/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fav1ra$gjo$1...@news.dialog.net.pl...

> Dmitriy Vyukov wrote:
>
>> What about lock-free reader pattern?
>
> Hardly ever applicable.

You think most application don't have any "read-mostly" access patterns into
shared data-structures? Did you know that lock-free reader patterns can
usually be made to replace read-write locks?

Chris Thomasson

unread,
Aug 27, 2007, 2:29:27 PM8/27/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:faum9b$930$1...@news.dialog.net.pl...

> Dmitriy Vyukov wrote:
>
>> Accidentally don't you talking about those bus-saturating 5-CAS/
>> operation algorithm by Michael, Valois etc? :)))
>
> No, about a single CAS/operation concurrent FIFO.

I would be interested in looking at the algorithm. However, I can multiplex
single-producer/consumer queues in a way that can beat CAS-based FIFO...

Chris Thomasson

unread,
Aug 27, 2007, 2:45:01 PM8/27/07
to
"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1188232033....@i38g2000prf.googlegroups.com...

I guess you could use the reversed LIFO stack trick:

http://groups.google.ca/group/comp.programming.threads/msg/a359bfb41c68b98b

However, that is a multiple-producer/single-consumer. Although, the depth of
the queue is proportional to the overhead of the reversal...

Chris Thomasson

unread,
Aug 27, 2007, 2:46:16 PM8/27/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fav1ra$gjo$1...@news.dialog.net.pl...

> Dmitriy Vyukov wrote:
>
>> What about lock-free reader pattern?
>
> Hardly ever applicable.

Tell that to Mckenney over on the Linux Kernel Mailing List!

;^)

Chris Thomasson

unread,
Aug 27, 2007, 2:57:30 PM8/27/07
to
Yikes! I need to make a correction:

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:3OGdnXm6gKyzhk7b...@comcast.com...
[...]


> I guess you could use the reversed LIFO stack trick:
>
> http://groups.google.ca/group/comp.programming.threads/msg/a359bfb41c68b98b
>
> However, that is a multiple-producer/single-consumer. Although, the depth
> of the queue is proportional to the overhead of the reversal...

The reversed LIFO stack trick _is_ a multi-producer/consumer. The _caveat_
is that a consumer cannot remove a single node from the queue, it's all or
nothing.

Dmitriy Vyukov

unread,
Aug 27, 2007, 4:27:38 PM8/27/07
to
On 27 , 22:57, "Chris Thomasson" <cris...@comcast.net> wrote:

> > I guess you could use the reversed LIFO stack trick:

It's not one CAS/operation. It's rather to say, it's 0.5 CAS/
operation.

Dmitriy V'jukov

Chris Thomasson

unread,
Aug 27, 2007, 5:30:29 PM8/27/07
to
"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1188144265.0...@o80g2000hse.googlegroups.com...
> On 26 , 07:39, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> > Yup. I want a massively multi-core cpu to be directly integrated onto a
>> > bank of per-cpu "dedicated" underlying memory.
>>
>> [...]
>>
>> This would be a "super computer on a chip", so to speak...
>
> We have this already. This is called NUMA ;)

Well, I want to see the day when there are hundreds-of-cores per-chip...

:^)


> But NUMA have more complicated model, and it's harder to program for
> it... So this is no more the plain-old-SMP... I think that there is a
> handful of people which can effectively program for NUMA....
> And yes, this seems the only way to scalable systems...

Yeah. It's basically all about binding the actions of a cpu to its local
memory bank. Also, you kind of have use algorithms, such as
RCU/RCU-SMR/vZOOM, which can run under relaxed coherency schemes... Do we
really need ccNUMA? Na...

> Btw, AMD claims that their multicore chips is NUMA a little. There are
> the same distance between all cores and memory bank, but interface to
> memory bank is 2-way. And first core have priority in access on first
> way, and second core have priority in access on second way. As I
> understand.

Interesting. I think they are doing something that involves sticking
multiple CPU's, GPU's and their various dedicated co-processors on a single
chip.

Chris Thomasson

unread,
Aug 27, 2007, 5:42:22 PM8/27/07
to
"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1188246458.7...@y42g2000hsy.googlegroups.com...

:^)

ken...@cix.compulink.co.uk

unread,
Aug 28, 2007, 5:00:27 PM8/28/07
to
In article <fauj4e$725$1...@news.dialog.net.pl>,
wyde...@mothers.against.spam-ii.uni.wroc.pl (Piotr Wyderski) wrote:

> Could the chip vendors give us an extremely fast
> point-to-point communication framework which can
> send a message from procesor A directly to processor
> B

It depends, I seem to remember that the transputer used a fast serial
interface to connect processors. It was fast for the time but limited
the number of processors that could directly talk to each other.

Ken Young

Chris Thomasson

unread,
Aug 28, 2007, 7:58:10 PM8/28/07
to
<ken...@cix.compulink.co.uk> wrote in message
news:g4OdnUQWOal2EUnb...@pipex.net...

DMA:

http://groups.google.com/group/comp.arch/msg/198a048b98fabe45

You can do message-passing with those instructions.

Chris Thomasson

unread,
Aug 29, 2007, 2:03:27 AM8/29/07
to
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:Qqudnc6VF470uU_b...@comcast.com...
> "Elcaro Nosille" <Elcaro....@mailinator.com> wrote in message
> news:46d1e270$0$16108$9b4e...@newsspool1.arcor-online.net...
>> Chris Thomasson schrieb:
>>
>>> WTF are the chip vendors bogging down their chips cache-coherency
>>> system designs with transactional memory bullshit!? I read somewhere
>>> that Intel is thinking about doing HTM for some upcoming designs. Why?
>>
>> It's obviously grievous for you that with support for transactional
>> memory, lock-free programming becomes very easy
>
> You have to know what your doing in order to even think about programming
> for a transactional memory system. The way you design things can effect
> the level of live-lock your most-likely going to experience during periods
> of moderate-to-severe load on the sub-system...
[...]

http://blogs.sun.com/dave/resource/cgo2007-TL1-STM-camera.pdf

Zombie Transactions... OUCH!

:^0

Chris Thomasson

unread,
Aug 29, 2007, 2:04:23 AM8/29/07
to
[...]
> http://blogs.sun.com/dave/resource/cgo2007-TL1-STM-camera.pdf
[...]

Well, at least that paper mentions some of the caveats.

Chris Thomasson

unread,
Aug 29, 2007, 3:16:46 AM8/29/07
to
[...]
> http://blogs.sun.com/dave/resource/cgo2007-TL1-STM-camera.pdf
[...]

They also seem to finally realize that lock-based TM is far superior to
lock-free TM.

Dmitriy Vyukov

unread,
Aug 29, 2007, 5:34:17 AM8/29/07
to
On 29 , 10:04, "Chris Thomasson" <cris...@comcast.net> wrote:

> Well, at least that paper mentions some of the caveats.

If I have some heavy operation, and I don't want to put that operation
into transaction, because I don't want operation to retry several
times (operation is heavy). Than I must back-off from TM to locking.
So I revert to the old habit and to the old problems!
I think such situations (heavy operation) will stay in programs, so TM
doesn't simplify things, TM will complicate things even more. Because
I will have to use and TM and locking, and cope with both, and prevent
deadlocks, and watch they don't mix together.

Or I misunderstood something? Can someone elucidate?

Dmitriy V'jukov

Piotr Wyderski

unread,
Aug 29, 2007, 2:58:00 PM8/29/07
to
Chris Thomasson wrote:

> Current SPARC and IA-32/64 instruction set works fine.
> You don't really need DWCAS.

What do you mean by DWCAS? cmpxchg8b in 32-bit
mode or a true CAS on two unrelated memory locations?
If you mean the first option, then I really do know the
tricks, but they are pretty ugly, e.g. a separate 4GiB
memory area reserved solely for the lock-free algorithms.

> As for the lock prefix on x86, well, you can design around using them.

It doesn't matter whether you use the lock prefix or simulate
it somehow (for example, using the load-to-store forwarding
bypass trick based on mixed-size memory operands). The
problem is with the cache cocherence protocol, which has
tremendous overheads.

> Have you seen my memory allocator algorithm that does this?

No, I have not, but I have also designed a blind-fast parallel
allocator for the industry. The key to high performance is to
do NUMA-style processing on a UMA machine, that is,
separate the processing units as much as possible.

> You mean performance for poorly designed lock-free algorithms are a
> disaster right?

No, I mean that anything that requires strong coherence
between different processing units is a performance disaster
whatever technique you choose. It can be a lock-free
algorithm based on CAS, it can be a mutex-based approach
-- it simply will be slow. The lock-free way will scale up
much better, indeed, but even then the performance will
be poor compared to a piece of code executed on a
uniprocessor machine. "much better" doesn't mean "good".

> Sure, but message-passing algorithms are really not that much help for
> certain synchronization needs.

Surely, but they can fullfil many other needs perfectly.

Best regards
Piotr Wyderski

Chris Thomasson

unread,
Aug 29, 2007, 3:20:15 PM8/29/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fb4fj8$s19$1...@news.dialog.net.pl...

> Chris Thomasson wrote:
>
>> Current SPARC and IA-32/64 instruction set works fine.
>> You don't really need DWCAS.
>
> What do you mean by DWCAS? cmpxchg8b in 32-bit
> mode or a true CAS on two unrelated memory locations?
> If you mean the first option, then I really do know the
> tricks, but they are pretty ugly, e.g. a separate 4GiB
> memory area reserved solely for the lock-free algorithms.

You do not have to use a reserved memory address-span. You can use offset
trick or simply not rely on algorithms that depend on DWCAS.


>> As for the lock prefix on x86, well, you can design around using them.
>
> It doesn't matter whether you use the lock prefix or simulate
> it somehow (for example, using the load-to-store forwarding
> bypass trick based on mixed-size memory operands). The
> problem is with the cache cocherence protocol, which has
> tremendous overheads.

I do not simulate the lock prefix in any way shape or form. However, I agree
with you that cache-coherency overhead is a real issue. I personally want
very relaxed models: I do not see any "real need" for super-strong coherency
schemes.

>> Have you seen my memory allocator algorithm that does this?
>
> No, I have not, but I have also designed a blind-fast parallel
> allocator for the industry. The key to high performance is to
> do NUMA-style processing on a UMA machine, that is,
> separate the processing units as much as possible.

Here is my invention:

----------
http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855

The algorithm I invented can implement multi-threaded allocator that can
scale to very many cores and has virtually zero-overheads. How can you say
that the overhead is far from being low? Here is a lock-free FIFO queue that
uses no interlocked rmw or store/load memory barriers:

http://groups.google.com/group/comp.lang.c++.moderated/msg/9bdc5f50be08d7d6

How can you say that algorithms is far from being low? BTW, that queue can
be multiplexed to provide high-speed message-passing between multiple CPU's.
I have more examples if you want.

----------


>> You mean performance for poorly designed lock-free algorithms are a
>> disaster right?
>
> No, I mean that anything that requires strong coherence
> between different processing units is a performance disaster
> whatever technique you choose.

vZOOM, RCU, RCU+SMR, and other like algorithms do not depend on strong
cache. In fact, they don't put any requirements on cache-coherency. They can
rely on weak models indeed...


> It can be a lock-free
> algorithm based on CAS, it can be a mutex-based approach
> -- it simply will be slow. The lock-free way will scale up
> much better, indeed, but even then the performance will
> be poor compared to a piece of code executed on a
> uniprocessor machine. "much better" doesn't mean "good".
>
>> Sure, but message-passing algorithms are really not that much help for
>> certain synchronization needs.
>
> Surely, but they can fullfil many other needs perfectly.

Yup. It depend on what your needs are exactly...

;^)

Dmitriy Vyukov

unread,
Aug 29, 2007, 3:19:43 PM8/29/07
to
On Aug 29, 10:58 pm, "Piotr Wyderski" <wyder...@mothers.against.spam-
ii.uni.wroc.pl> wrote:

> No, I have not, but I have also designed a blind-fast parallel
> allocator for the industry. The key to high performance is to
> do NUMA-style processing on a UMA machine, that is,
> separate the processing units as much as possible.

How do you solve the problem when application have steady pattern
which causes memory to be allocated in one thread and freed in
another? For example producer-consumer pattern.

Dmitriy V'jukov

Piotr Wyderski

unread,
Aug 29, 2007, 3:11:09 PM8/29/07
to
Chris Thomasson wrote:

>> No, about a single CAS/operation concurrent FIFO.
>
> I would be interested in looking at the algorithm.

I bet you would. :-]

> However, I can multiplex single-producer/consumer
> queues in a way that can beat CAS-based FIFO...

But you cannot implement an extremely fine-grained
multi-producer/multi-consumer FIFO which is free
of the coherence overhead. With 8+ CPUs spinning
on such a queue your bus simply goes crazy. One
must use different approaches + a lot of cheating to
achieve similar results.

Best regards
Piotr Wyderski

Chris Thomasson

unread,
Aug 29, 2007, 3:30:37 PM8/29/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fb4gbu$sjr$1...@news.dialog.net.pl...

Well, here is the queue:

http://groups.google.com/group/comp.lang.c++.moderated/msg/9bdc5f50be08d7d6


I don't spin on the queue. I use an eventcount algorithm I invented:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/aa8c62ad06dbb380


This gives high-performance condvar-like ability to many different types of
lock-free algorithms.


Piotr Wyderski

unread,
Aug 29, 2007, 4:19:04 PM8/29/07
to
Dmitriy Vyukov wrote:

> How do you solve the problem when application have steady pattern
> which causes memory to be allocated in one thread and freed in
> another? For example producer-consumer pattern.

I detect this situation and give the block back to the producer
using a lock-free "slow path". The producer's allocator does
not touch that blocks if not absolutely necessary and do it in
batches to spread the cost of cache coherence.

Best regards
Piotr Wyderski

Dmitriy Vyukov

unread,
Aug 29, 2007, 4:27:08 PM8/29/07
to
On Aug 30, 12:19 am, "Piotr Wyderski" <wyder...@mothers.against.spam-

It looks very similar to Chris Thomasson's invention (by this brief
description).
I can suppose that you use "reversed" multiple-producer/single-
consumer fifo queue based on lifo-stack to pass blocks to "producer"
thread. And "producer" grabs all queue by one XCHG ;)

Dmitriy V'jukov

Chris Thomasson

unread,
Aug 29, 2007, 4:29:55 PM8/29/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fb4kb9$v8s$1...@news.dialog.net.pl...

That how my invention does it. BTW, you should contact:

http://groups.google.com/group/comp.programming.threads/msg/0156c6caa9bd7f69

I have had lengthy discussion with the professor and his PHD student. I can
prove that I invented my algorithm in 2005. I use it in a commercial
product. BTW why didn't you comment on my post!:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7eb03aabbe31c855

Do you have a patent application on it?

Chris Thomasson

unread,
Aug 29, 2007, 4:31:42 PM8/29/07
to
[...]

> I have had lengthy discussion with the professor and his PHD student. I
> can prove that I invented my algorithm in 2005. I use it in a commercial
> product. BTW why didn't you comment on my post!:
[...]

God damn typo!

I can prove I invented the allocator, in-house, in 2004. I posted the algo
in comp.programming.threads in 2005.

Piotr Wyderski

unread,
Aug 29, 2007, 4:48:29 PM8/29/07
to
Chris Thomasson wrote:

> You do not have to use a reserved memory address-span.
> You can use offset trick

Or you can use indirect addressing based on an "index to address"
translation LUT and many other things. But it is ugly. Not as ugly
as SMR or LFRC, but ugly enough not to use it when it is not critical.

> However, I agree with you that cache-coherency overhead is a real issue.

It limits the capacity of your inter-CPU communication channel to
mere ten million of FIFO transactions per second (when you are
lucky). I can get hundreds of millions when "cheating", that is, the
processor serves its own requests, so it is not a true FIFO queue.
My tasks are fine-grained, so it makes a big difference.

> I personally want very relaxed models: I do not see any
> "real need" for super-strong coherency schemes.

It doesn't help in many situations. The relaxed consisency models
are useful when you pass simple objects, but often a pointer to
a very complex object is being passed through the queue. Then
you must use a full-blown pair of acqire and release barriers no
matter what -- the view of the content of the received object
_must_ be consistent.

> How can you say that the overhead is far from being low?

Perhaps that is because we use different bases for comparison.
In my case it is the performance of an uniprocessor machine,
yours seems to be the lock-based approach.

> Here is a lock-free FIFO queue that
> uses no interlocked rmw or store/load memory barriers:

Does it guarantee consistent view of an object passed by pointer?
It does not, so it doesn't matter whether or not the FIFO itself
requires a barrier. The contents require it.

> How can you say that algorithms is far from being low?

It's quite easy, I can teach you if you wish. :-)

Best regards
Piotr Wyderski

Chris Thomasson

unread,
Aug 29, 2007, 4:58:39 PM8/29/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fb4m2e$iq$1...@news.dialog.net.pl...

> Chris Thomasson wrote:
>
>> You do not have to use a reserved memory address-span.
>> You can use offset trick
[...]

>> Here is a lock-free FIFO queue that
>> uses no interlocked rmw or store/load memory barriers:
>
> Does it guarantee consistent view of an object passed by pointer?
> It does not, so it doesn't matter whether or not the FIFO itself
> requires a barrier. The contents require it.
[...]

Yes it does; please note that current x86 has #LoadStore | #StoreStore
barrier implied for stores, and #LoadStore | #LoadLoad imples for loads.
Please look at the code:

-----------

align 16
ac_i686_queue_spsc_push PROC
mov eax, [esp + 4]
mov ecx, [esp + 8]
mov edx, [eax + 4]
; sfence may be needed on future x86
mov [edx], ecx
mov [eax + 4], ecx
ret
ac_i686_queue_spsc_push ENDP


align 16
ac_i686_queue_spsc_pop PROC
push ebx
mov ecx, [esp + 8]
mov eax, [ecx]
cmp eax, [ecx + 4]
je ac_i686_queue_spsc_pop_failed
mov edx, [eax]
; lfence may be needed on future x86
mov ebx, [edx + 12]
mov [ecx], edx
mov [eax + 12], ebx
pop ebx
ret

ac_i686_queue_spsc_pop_failed:
xor eax, eax
pop ebx
ret
ac_i686_queue_spsc_pop ENDP

-----------

Notice how I say/comment that s/lfence instructions might be need on future
intel cpu's? If you want this on the SPARC on non-TSO mode, well, you need a
#StoreStore. You don't need full acquire release. #StoreLoad is not
required here. If you on Alpha, then you need membar for producer AND membar
for consumer.


> How can you say that algorithms is far from being low?

>> It's quite easy, I can teach you if you wish. :-)

How is my allocator invention, eventcount or my queue not low-overhead?

Piotr Wyderski

unread,
Aug 29, 2007, 4:55:57 PM8/29/07
to
Dmitriy Vyukov wrote:

> I can suppose that you use "reversed" multiple-producer/single-
> consumer fifo queue based on lifo-stack to pass blocks to "producer"
> thread. And "producer" grabs all queue by one XCHG ;)

Precisely, I use this ancient trick among others. :-)

Best regards
Piotr Wyderski

Chris Thomasson

unread,
Aug 29, 2007, 5:04:03 PM8/29/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fb4mgd$106$1...@news.dialog.net.pl...

Why didn't you mention this earlier? Your 1-cas per-operation fifo can be
improved.

Chris Thomasson

unread,
Aug 29, 2007, 5:07:53 PM8/29/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fb4mgd$106$1...@news.dialog.net.pl...

I suppose that you use the same reference counting magic I created for my
allocator? Its a fairly neat technique, IMO at least.

Piotr Wyderski

unread,
Aug 29, 2007, 5:18:42 PM8/29/07
to
Chris Thomasson wrote:

> That how my invention does it.

I don't know your invention, so can't compare our approaches.
However, this way is pretty natural, so I am not surprised that
there are other users of this trick.

> BTW, you should contact:

Contact whom and for what purpose?

> I have had lengthy discussion with the professor and his PHD
> student. I can prove that I invented my algorithm in 2005.

Good for you, but what's going on? I don't claim that you
are not the author of your algorithm, I don't even know it.
I will read your documentation when I find a moment of
spare time, no problem. But I will not discuss about the
differences in our design in public, these things are too
valuable. You know my e-mail address, Chris, as there
are your previous mails in the mailbox. :-)

Mine dates 2002 or so, but who cares?

> I use it in a commercial product.

So do I. The performance is unbelievable. :-)

> BTW why didn't you comment on my post!:

I don't read comp.programming.threads, if I see your
posts, it's when you send them to comp.arch.

> Do you have a patent application on it?

On what, exactly? Owner detection? Inverted FIFO?
Victim block caching? Pool autobalancing?

Best regards
Piotr Wyderski

Piotr Wyderski

unread,
Aug 29, 2007, 5:21:13 PM8/29/07
to
Chris Thomasson wrote:

> I suppose that you use the same reference counting magic I created for my
> allocator? Its a fairly neat technique, IMO at least.

I don't have any reference counting in my allocator.

Best regards
Piotr Wyderski

PS. private mail, please... :-)

Chris Thomasson

unread,
Aug 29, 2007, 5:39:08 PM8/29/07
to
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:ePKdnTRuGpojS0jb...@comcast.com...

> "Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
> message news:fb4kb9$v8s$1...@news.dialog.net.pl...
>> Dmitriy Vyukov wrote:
>>
>>> How do you solve the problem when application have steady pattern
>>> which causes memory to be allocated in one thread and freed in
>>> another? For example producer-consumer pattern.
>>
>> I detect this situation and give the block back to the producer
>> using a lock-free "slow path". The producer's allocator does
>> not touch that blocks if not absolutely necessary and do it in
>> batches to spread the cost of cache coherence.
>
> That how my invention does it.
[...]

BTW, Piotr, did you think if using the allocator syncronization schem to
allow the transformation of an inherently single-threaded malloc/free into a
full-blown scaleable multi-threaded one:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb87f9cb2cb62d84

Fir instance, I used the invention to transform a single-threaded windows
Heap API:

http://msdn2.microsoft.com/en-us/library/aa366711.aspx

into a scaleable multi-threaded allocator. You use the HEAP_NO_SERIALIZE
flag for the HeapCreate function:

http://msdn2.microsoft.com/en-us/library/aa366599.aspx

in order to make it single-threaded..

The funny thing about this is... Well, the transformed windows heap beats
the pants of the multi-threaded heap created without the HEAP_NO_SERIALIZE
flag set...

Piotr Wyderski

unread,
Aug 29, 2007, 5:38:52 PM8/29/07
to
Chris Thomasson wrote:

> Why didn't you mention this earlier? Your 1-cas per-operation fifo can be
> improved.

No, the FIFO part (call it "execution engine") and the memory
manager are completely separated and I don't want to use this
excessive knowledge about block internals to implement better
lock-free algorithms on top of it. In fact I can't, because there
is a SCRAM compilation switch to disable my allocator in
case of emergency, because it has proven to be "too parallel"
in one of our old products -- there were some race conditions
and didn't show up only because of the internal synchronization
protocol of the original malloc() :-D

It has been fixed, of course, but the "SCRAM" is still there
(there are many others, in fact) and I want my lock-free
algorithms to work in emergency mode too, so there is no
room for improvements, at least of this kind.

Best regards
Piotr Wyderski

Chris Thomasson

unread,
Aug 29, 2007, 5:48:07 PM8/29/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fb4nrh$1t8$1...@news.dialog.net.pl...

> Chris Thomasson wrote:
>
>> That how my invention does it.
>
> I don't know your invention, so can't compare our approaches.
> However, this way is pretty natural, so I am not surprised that
> there are other users of this trick.
>
>> BTW, you should contact:
>
> Contact whom and for what purpose?

The people behind the StreamFlow allocator. I had to inform them that their
paper subission to

----
Proceedings of the 2006 ACM SIGPLAN International Symposium on Memory
Management, Ottawa, Canada, June 2006.
----

is based on prior-art.


>> I have had lengthy discussion with the professor and his PHD
>> student. I can prove that I invented my algorithm in 2005.
>
> Good for you, but what's going on? I don't claim that you
> are not the author of your algorithm, I don't even know it.
> I will read your documentation when I find a moment of
> spare time, no problem. But I will not discuss about the
> differences in our design in public, these things are too
> valuable. You know my e-mail address, Chris, as there
> are your previous mails in the mailbox. :-)

This is already out in public. So, might as well discuss it here?

:^0


> Mine dates 2002 or so, but who cares?
>
>> I use it in a commercial product.
>
> So do I. The performance is unbelievable. :-)

Yes, the peformance is extreme indeed!

:^)

>> BTW why didn't you comment on my post!:
>
> I don't read comp.programming.threads, if I see your
> posts, it's when you send them to comp.arch.
>
>> Do you have a patent application on it?
>
> On what, exactly? Owner detection? Inverted FIFO?
> Victim block caching? Pool autobalancing?

Here are some of the most relevant claims:


_____________________________
1. I claim use of no interlocked rmw instruction and/or memory
barrier in the fast-path of a local allocations or deallocations.

2. I claim use of a single interlocked SWAP instruction an no memory
barrier for the slow-path of a local allocation.

3. I claim use of interlocked Single CAS instruction and #StoreStore
memory barrier in the slow-path of a remote deallocation.

3a. I claim the interlocked Single CAS in claim 3 will operate on
memory locations that are the size of a system word.

4. I claim use of a single interlocked SWAP instruction #LoadStore
memory barrier in thread termination.
_____________________________

Anything like that?

Chris Thomasson

unread,
Aug 29, 2007, 5:53:02 PM8/29/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fb4gbu$sjr$1...@news.dialog.net.pl...

> Chris Thomasson wrote:
>
>>> No, about a single CAS/operation concurrent FIFO.
>>
>> I would be interested in looking at the algorithm.
>
> I bet you would. :-]

[...]

I sure would!

:^)

Piotr Wyderski

unread,
Aug 29, 2007, 6:12:19 PM8/29/07
to
Chris Thomasson wrote:

> The people behind the StreamFlow allocator. I had to inform them that
> their paper subission to
>
> ----
> Proceedings of the 2006 ACM SIGPLAN International Symposium on Memory
> Management, Ottawa, Canada, June 2006.
> ----
>
> is based on prior-art.

I don't have to inform them, my prior art is much older than their paper.
But I'll check whether they use something similar to my approach.

> This is already out in public.

All the public knows is that there are two extremely fast memory
allocators out there and that both of them have something called
"slow path". It is far enough for them. ;-)

> So, might as well discuss it here?

No, industrial espionage, you know. :-) Chris, we are talking
about a marvel that gives our respective companies serious
performance advantages. I can swear you that I will not discuss
about its guts in public. :-)

> Yes, the peformance is extreme indeed!

I have done a lot to make its fast path to be really FAST. :-)

> _____________________________
> 1. I claim use of no interlocked rmw instruction and/or memory
> barrier in the fast-path of a local allocations or deallocations.

Agreed.

> 2. I claim use of a single interlocked SWAP instruction an no memory
> barrier for the slow-path of a local allocation.

I don't understand it, because it requires the knowledge of your
algorithm, so please be patient. But I have no such thing as a slow
path of local allocation (well, there is, but it is extremely rare), so
probably the answer is "no".

> 3. I claim use of interlocked Single CAS instruction and #StoreStore
> memory barrier in the slow-path of a remote deallocation.

At most one CAS in this case.

> 3a. I claim the interlocked Single CAS in claim 3 will operate on
> memory locations that are the size of a system word.

It is not a claim, but a property of the underlying system.

> 4. I claim use of a single interlocked SWAP instruction #LoadStore
> memory barrier in thread termination.

No, then my allocator follows a lock-based path (which implies
a full barrier), because there is no need to optimize this case.

Best regards
Piotr Wyderski

Piotr Wyderski

unread,
Aug 29, 2007, 6:42:26 PM8/29/07
to
Chris Thomasson wrote:

> Yes it does; please note that current x86 has #LoadStore | #StoreStore
> barrier implied for stores, and #LoadStore | #LoadLoad imples for loads.

Yes, I know this, but my code must run on many other architectures
(SPARC, POWER, PA-RISC, there is a rumor about the Itanium
family), so I must specify the desired consistency semantics of CAS
manually (via a template parameter, its C++ with a bit of the assembly
language). The resulting code contains no barrier instruction on the
IA-32/EMT64 family, indeed. But logically thre is a barrier. For the
same reason I have consistency-augmented versions of load() and
store(), which do not even require atomic instructions on IA-32.

> Notice how I say/comment that s/lfence instructions might be need on
> future intel cpu's? If you want this on the SPARC on non-TSO mode, well,
> you need a #StoreStore.

Surely, been there... :-)

> If you on Alpha, then you need membar for producer AND membar for
> consumer.

Fortunately Aplha is dead, its barrier on dependent
loads was a serious pain in the, erm, call it neck...

> How is my allocator invention, eventcount or my queue not low-overhead?

Give me a couple of days. :-)

Best regards
Piotr Wyderski

Dmitriy Vyukov

unread,
Aug 29, 2007, 6:53:55 PM8/29/07
to
On 30 , 02:12, "Piotr Wyderski"

> No, industrial espionage, you know. :-) Chris, we are talking
> about a marvel that gives our respective companies serious
> performance advantages. I can swear you that I will not discuss
> about its guts in public. :-)

As I understand you don't want to reveal any details of your
algorithms and products you are working on. But can you answer please
to 2 harmless questions? ;) I am just interested in what field such
technics are relevant. And in what field companies are interested in
such specialists.
What software you are producing? I can suppose that this is server
systems :) Is it telecom or VoIP or what?
On what hardware software is working? Is it commodity SMP/multicore
x86 or this is "big" servers (SUN or IBM)?

Dmitriy V'jukov

Piotr Wyderski

unread,
Aug 29, 2007, 7:08:06 PM8/29/07
to
Dmitriy Vyukov wrote:

> I am just interested in what field such technics are relevant.

Whenever you want to process a tremendous amount of
data as fast as possible using a big multiprocessor machine.

> And in what field companies are interested in such specialists.

In my case its the financial/banking sector.

> What software you are producing?

System integrators, business model execution frameworks,
dataflow processing, various virtual machines etc.

> I can suppose that this is server systems :)

Yes, you may call it so.

> On what hardware software is working?

On almost everything that has ever be invented, frankly. :-)
Name a big enough "serious" processor familly and it will
be a bull's eye hit.

> Is it commodity SMP/multicore x86 or this is "big" servers (SUN or IBM)?

All of them (you know, it is much cheaper to build a cluster of commodity
machines and it works better than a mainframe), but the "big iron" segment
is our primary target.

Best regards
Piotr Wyderski

Dmitriy Vyukov

unread,
Aug 29, 2007, 7:19:17 PM8/29/07
to
On 30 , 03:08, "Piotr Wyderski" <wyder...@mothers.against.spam-

ii.uni.wroc.pl> wrote:
> Dmitriy Vyukov wrote:
> > I am just interested in what field such technics are relevant.
>
> Whenever you want to process a tremendous amount of
> data as fast as possible using a big multiprocessor machine.

All want to process data as fast as possible... but it seems that
nobody hires such specialists :)

Ok. Thank you.


Dmitriy V'jukov

Chris Thomasson

unread,
Aug 29, 2007, 10:33:10 PM8/29/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fb4qvo$40m$1...@news.dialog.net.pl...

> Chris Thomasson wrote:
>
>> The people behind the StreamFlow allocator. I had to inform them that
>> their paper subission to
>>
>> ----
>> Proceedings of the 2006 ACM SIGPLAN International Symposium on Memory
>> Management, Ottawa, Canada, June 2006.
>> ----
>>
>> is based on prior-art.
>
> I don't have to inform them, my prior art is much older than their paper.
> But I'll check whether they use something similar to my approach.
>
>> This is already out in public.
>
> All the public knows is that there are two extremely fast memory
> allocators out there and that both of them have something called
> "slow path". It is far enough for them. ;-)
>
>> So, might as well discuss it here?
>
> No, industrial espionage, you know. :-) Chris, we are talking
> about a marvel that gives our respective companies serious
> performance advantages. I can swear you that I will not discuss
> about its guts in public. :-)

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/f2c94118046142e8

:^0

Okay. Well, perhaps we can have a private conversation when we get some free
time. Should be both sign a NDA?

>> Yes, the peformance is extreme indeed!
>
> I have done a lot to make its fast path to be really FAST. :-)
>
>> _____________________________
>> 1. I claim use of no interlocked rmw instruction and/or memory
>> barrier in the fast-path of a local allocations or deallocations.
>
> Agreed.

Yup. This is totally equal to single-thread performance in this case.

>
>> 2. I claim use of a single interlocked SWAP instruction an no memory
>> barrier for the slow-path of a local allocation.
>
> I don't understand it, because it requires the knowledge of your
> algorithm, so please be patient. But I have no such thing as a slow
> path of local allocation (well, there is, but it is extremely rare), so
> probably the answer is "no".

The local-allocation slow-path is hit when the calling threads allocator
exhausts its memory buckets.

>> 3. I claim use of interlocked Single CAS instruction and #StoreStore
>> memory barrier in the slow-path of a remote deallocation.
>
> At most one CAS in this case.

Yup.


>> 3a. I claim the interlocked Single CAS in claim 3 will operate on
>> memory locations that are the size of a system word.
>
> It is not a claim, but a property of the underlying system.

Agreed.


>> 4. I claim use of a single interlocked SWAP instruction #LoadStore
>> memory barrier in thread termination.
>
> No, then my allocator follows a lock-based path (which implies
> a full barrier), because there is no need to optimize this case.

I am interested in how you integrate the locking into the mix; perhaps I
will send you an e-mail. My allocator usually runs on a method of using a
threads actual stack as the memory pool, and I do use POSIX condition
variables in the commercial version of the code. The locking comes after
claim 4.


___PS___
Well, at least I am sure that we developed the algorithms independently
without any exchange of information before hand. So, we have invented some
of the best-performance memory allocators "currently" out there! :^)

Chris Thomasson

unread,
Aug 29, 2007, 11:15:47 PM8/29/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fb4p10$2nu$1...@news.dialog.net.pl...

> Chris Thomasson wrote:
>
>> Why didn't you mention this earlier? Your 1-cas per-operation fifo can be
>> improved.
>
> No, the FIFO part (call it "execution engine") and the memory
> manager are completely separated and I don't want to use this
> excessive knowledge about block internals to implement better
> lock-free algorithms on top of it. In fact I can't, because there
> is a SCRAM compilation switch to disable my allocator in
> case of emergency, because it has proven to be "too parallel"
> in one of our old products --

How could a _strictly conforming_ malloc/free, like my allocator and I am
assuming yours as well, be a bad thing? The underlying implementation of
malloc/free should be completely opaque and of no concern to any standard
conforming application. Humm, you mean it blew something apart with prior
race-condition(s) being exposed due to the increased parallelism? I have
dealt with this issue on some consulting work before; something like this:

http://groups.google.com/group/comp.programming.threads/msg/9a5500d831dd2ec7

Good times! ;^) As for the queue implementation, I can't quite see how you
design the queue with 1-CAS per-op. You must be packing the head-and-tail
into a single word or something. Are you using a bounded queue?

Chris Thomasson

unread,
Aug 29, 2007, 11:42:57 PM8/29/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fb4so6$58r$1...@news.dialog.net.pl...

> Chris Thomasson wrote:
>
>> Yes it does; please note that current x86 has #LoadStore | #StoreStore
>> barrier implied for stores, and #LoadStore | #LoadLoad imples for loads.
>
> Yes, I know this, but my code must run on many other architectures

Yes. My experimental AppCore library, at this time, is for the x86 only. My
commercial library, vZOOM, currently runs on 32/64-bit SPARC, 32/64-bit
POWER, IA-32/64 and the ARM9. I also have a _highly experimental_ version
for the Cell BE. Programming lock-free stuff for the Cell is interesting to
say the least, stuff like LL/SC augmented with DMA... Doing the ARM9 version
was pretty interesting as well, the only atomic operation was SWP. I had
somebody interested in licensing the library for use in a tightly-packed
embedded environment running Quadros. I hade to simulate a heap using the
stacks of Quadros "Tasks"... Luckily for me, the allocator runs on thread
stacks anyway.

> (SPARC, POWER, PA-RISC, there is a rumor about the Itanium
> family), so I must specify the desired consistency semantics of CAS
> manually (via a template parameter, its C++ with a bit of the assembly
> language). The resulting code contains no barrier instruction on the
> IA-32/EMT64 family, indeed. But logically thre is a barrier. For the
> same reason I have consistency-augmented versions of load() and
> store(), which do not even require atomic instructions on IA-32.

Yeah. A coherent abstraction over atomic operations can work out nice.

>> Notice how I say/comment that s/lfence instructions might be need on
>> future intel cpu's? If you want this on the SPARC on non-TSO mode, well,
>> you need a #StoreStore.
>
> Surely, been there... :-)

:^)

>> If you on Alpha, then you need membar for producer AND membar for
>> consumer.
>
> Fortunately Aplha is dead, its barrier on dependent
> loads was a serious pain in the, erm, call it neck...

No shi%. IIRC, you hade to use a full-blown MB instruction.

>> How is my allocator invention, eventcount or my queue not low-overhead?
>
> Give me a couple of days. :-)

Okay. ;^)

Piotr Wyderski

unread,
Aug 30, 2007, 5:56:25 AM8/30/07
to
Chris Thomasson wrote:

> Well, perhaps we can have a private conversation when we get some free
> time.

Sounds great. :-)

> Should be both sign a NDA?

If you wish, but a gentlemen's agreement would be fine with me.

> Yup. This is totally equal to single-thread performance in this case.

And this kind of performance can also be boosted.

> The local-allocation slow-path is hit when the calling threads allocator
> exhausts its memory buckets.

Well, there finally is a lock in the underlying malloc(),
when the buckets are all empty and you need to refill
them -- and you can do nothing abou that.

> Well, at least I am sure that we developed the algorithms independently
> without any exchange of information before hand.

Yes, I am totally convinced about the independence of our efforts
to design that memory allocators. It was so independent that I still
don't even know whether they are similar or not. :o)

Best regards
Piotr Wyderski

Dmitriy Vyukov

unread,
Aug 30, 2007, 11:11:04 AM8/30/07
to
On Aug 30, 1:56 pm, "Piotr Wyderski"

> > Yup. This is totally equal to single-thread performance in this case.
>
> And this kind of performance can also be boosted.

What do you mean? That single-threaded allocation algorithm can be
optimized? Or synchronization algorithm can be optimized? But how can
multithreaded allocator be faster than single-threaded?


Dmitriy V'jukov

Dmitriy Vyukov

unread,
Aug 30, 2007, 11:16:27 AM8/30/07
to
On Aug 30, 12:48 am, "Piotr Wyderski"

> It limits the capacity of your inter-CPU communication channel to

> mere ten million ofFIFOtransactions per second (when you are


> lucky). I can get hundreds of millions when "cheating", that is, the
> processor serves its own requests, so it is not a trueFIFOqueue.
> My tasks are fine-grained, so it makes a big difference.

I can suppose that you use something like ABP work stealing deque.
Something like this:
http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf
Am I right?

If so, I am interesting to what limit you optimize it? It seems that
you use only highly optimized solutions ;)

How many CAS'es or expensive memory barriers do you have in pop()/
push() and steal() functions?


Dmitriy V'jukov

Piotr Wyderski

unread,
Aug 30, 2007, 11:14:50 AM8/30/07
to
Dmitriy Vyukov wrote:

> What do you mean? That single-threaded allocation algorithm can be
> optimized?

Yes.

Best regards
Piotr Wyderski

Chris Thomasson

unread,
Aug 31, 2007, 4:26:53 AM8/31/07
to
"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1188380057.2...@g4g2000hsf.googlegroups.com...
> On 29 , 10:04, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> Well, at least that paper mentions some of the caveats.
>
> If I have some heavy operation, and I don't want to put that operation
> into transaction, because I don't want operation to retry several
> times (operation is heavy). Than I must back-off from TM to locking.
> So I revert to the old habit and to the old problems!
> I think such situations (heavy operation) will stay in programs, so TM
> doesn't simplify things, TM will complicate things even more. Because
> I will have to use and TM and locking, and cope with both, and prevent
> deadlocks, and watch they don't mix together.
>
> Or I misunderstood something? Can someone elucidate?

I don't think you missed anything...

:^0

Chris Thomasson

unread,
Aug 31, 2007, 4:34:29 AM8/31/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fb4p10$2nu$1...@news.dialog.net.pl...

> Chris Thomasson wrote:
>
>> Why didn't you mention this earlier? Your 1-cas per-operation fifo can be
>> improved.
>
> No, the FIFO part (call it "execution engine") and the memory
> manager are completely separated and I don't want to use this
> excessive knowledge about block internals to implement better
> lock-free algorithms on top of it.

[...]

You can keep the queue and the memory manager completely separated and
_still_ use, as Dmitriy calls it, the .5%-cas per-operation fifo technique.

:^)

Chris Thomasson

unread,
Sep 1, 2007, 2:58:34 AM9/1/07
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:fb4m2e$iq$1...@news.dialog.net.pl...

> Chris Thomasson wrote:
>
>> You do not have to use a reserved memory address-span.
>> You can use offset trick
>
> Or you can use indirect addressing based on an "index to address"
> translation LUT and many other things. But it is ugly. Not as ugly
> as SMR or LFRC, but ugly enough not to use it when it is not critical.

[...]
Index trick is not that ugly:

http://groups.google.com/group/comp.arch/msg/7737753cc958dafb

:^)

Dmitriy V'jukov

unread,
Sep 5, 2007, 8:20:02 AM9/5/07
to
On Aug 28, 1:30 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> > Btw,AMDclaims that their multicore chips isNUMAa little. There are
> > the same distance between all cores andmemorybank, but interface to
> >memorybank is 2-way. And first core have priority in access on first
> > way, and second core have priority in access on second way. As I
> > understand.
>
> Interesting. I think they are doing something that involves sticking
> multiple CPU's, GPU's and their various dedicated co-processors on a single
> chip.

See this article for details
THE AMD OPTERON NORTHBRIDGE ARCHITECTURE:
http://www.computer.org/portal/cms_docs_micro/micro/homepage/2007/m2010.pdf

Dmitriy V'jukov

Chris Thomasson

unread,
Sep 11, 2007, 11:53:23 PM9/11/07
to
"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:1188994802.6...@r29g2000hsg.googlegroups.com...

Coherent HyperTransport... Scaleable DMA?

Joe Seigh

unread,
Sep 12, 2007, 6:21:49 AM9/12/07
to


There's some stuff here
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf

They're presuming the OS scheduler is ccNUMA aware and does the right thing
as far as the application is concerned which I think means non ccNUMA
aware applications only get scheduled on one chip. Otherwise you'd get
non ccNUMA aware applications maxing out cache coherency traffic on the
hyperlinks.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Chris Thomasson

unread,
Sep 18, 2007, 7:41:40 AM9/18/07
to
"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:15PFi.6507$Ot1.6386@trnddc07...

Oh my... I think your dead on the money with your clever observations Joe!


0 new messages