http://groups.google.com/group/comp.arch/search?hl=en&group=comp.arch&q=%22chris+thomasson%22+transactional+memory
(here are some of my worries!)
I guess the chip vendors want their coherency systems to be at one with the
live-locking in TM anything!
:^/
It seems that relaxed consistency and high-performance distributed
parallel computing is synonyms.
You can see design of Google File System:
http://209.85.163.132/papers/gfs-sosp2003.pdf
They include relaxed consistency even on file system level to achieve
real distribution. GFS can work on clusters of thousands of machines
exactly because of relaxed consistency.
Dmitriy V'jukov
Because they don't have any better ideas about how to get all
that legacy software that isn't multithreaded to run in parallel
across the dozens of cores they will soon have. Will it work?
We'll see over the next decade. Azul is already promising they
can do this for Java with their Java big-iron (which has HTM).
David
> Because they don't have any better ideas about how to get all
> that legacy software that isn't multithreaded to run in parallel
> across the dozens of cores they will soon have.
Main question: what about new, multicore-aware software?
Will STM mode be optional? I.e. can I get plain-old-relaxed memory
model for my multicore-aware software?
Dmitriy V'jukov
That's true and unlikely to be overcome.
Einstein's theory of relativity defines events that are closer together
in time than in space as necessarily uncorrelated. The conversion factor
is the speed of light, or in electronics about 2/3 of it.
So the time window of a synchronisation point is at least as wide as the
two locations that are to be synchronized are away in units of c.
We are pretty close to this limit. At 3GHz the area which might be
synchronized within one clock cycle under ideal conditions is limited to
about 3,5cm. Since you have to include switching times too it is less.
Moving the synchronization from the CPUs to a common unit (e.g. the
memory) reduces the synchronization delays.
Marcel
My guess (and I'm just a software guy who's read a little about
the TM proposals and the early implementations) is that TM will
always be optional in a technical sense. But if it works, or if
it enough people are convinced that it will work, there will be
strong pressures on software projects to just rely on TM, and not
to worry about designing software that will scale to many cores
using other techniques.
David
> My guess (and I'm just a software guy who's read a little about
> the TM proposals and the early implementations) is that TM will
> always be optional in a technical sense.
So, Chris, there is nothing to worry about. You still will be able to
create more scalable and high-performance software ;)
> But if it works, or if
> it enough people are convinced that it will work, there will be
> strong pressures on software projects to just rely on TM, and not
> to worry about designing software that will scale to many cores
> using other techniques.
Oh. It seems that you must hurry, to not permit this convenience ;)
Dmitriy V'jukov
I am not so sure that TM can reliably scale to many cores without simply
flooding everything with costly coherency traffic...
:^0
Yup. I want a massively multi-core cpu to be directly integrated onto a bank
of per-cpu "dedicated" underlying memory. Think of a physically large chip
sitting on top of gb(s) of pure "on-board integrated memory":
http://groups.google.com/group/comp.arch/msg/c9d99ae1251f2462
The physical distance from the chip and its dedicated memory bank could be
very-very small indeed!
[...]
This would be a "super computer on a chip", so to speak...
> > Yup. I want a massively multi-core cpu to be directly integrated onto a
> > bank of per-cpu "dedicated" underlying memory.
>
> [...]
>
> This would be a "super computer on a chip", so to speak...
We have this already. This is called NUMA ;)
But NUMA have more complicated model, and it's harder to program for
it... So this is no more the plain-old-SMP... I think that there is a
handful of people which can effectively program for NUMA....
And yes, this seems the only way to scalable systems...
Btw, AMD claims that their multicore chips is NUMA a little. There are
the same distance between all cores and memory bank, but interface to
memory bank is 2-way. And first core have priority in access on first
way, and second core have priority in access on second way. As I
understand.
Dmitriy V'jukov
Or, as core counts increase, legacy software parallelized
using TM might start to hit a wall of livelock, deadlock, or
limited parallelism due to dependencies between transactions.
But if TM is reasonably effective up to some number of cores,
then it will be used by the parts of the software world that have
to large amounts of legacy code to support. The question of what
to do when single core performance hits its limits gets answered
by TM. Perhaps a few years later another question arises, what
to do when TM hits its limits? TM buys a few years of increased
performance with minimal code changes.
Optimists might hope that by the time TM hits its limits, other
parts of the software industry, not so dependent on legacy code
bases, will have pioneered other longer-term approaches to
developing software that exploits large numbers of cores.
David
> Perhaps a few years later another question arises, what
> to do when TM hits its limits? TM buys a few years of increased
> performance with minimal code changes.
It is unlikely that hardware guys will implement some feature that
will serve for only few years... To be implemented feature must serve
for at least few decades!
Dmitriy V'jukov
And what about high-performance/scalable low-overhead custom
synchronization methods? :)
Dmitriy V'jukov
So be it.
You have to know what your doing in order to even think about programming
for a transactional memory system. The way you design things can effect the
level of live-lock your most-likely going to experience during periods of
moderate-to-severe load on the sub-system...
> and your knowledge
> about lock-free programming with the usual CAS- or LL/SC-methods
> will be less acknowledged.
Well, I have always wanted to clean one of my in-house STM designs off for a
while now. Perhaps my skills with interlocked rmw and memory barrier
instructions into building a 'Minimalist STM Sub-System'. Do you think that
would be a good idea?
It seems that with multi-core the vendors are desperately in need of
programmers that can create high-performance concurrent software. So, I
guess HTM would be a sort of failsafe for a chip vendor. They don't seem to
realize that their existing instruction sets can be made to scale to very
many cores using low-overhead software-based synchronization algorithms...
> It seems that with multi-core the vendors are desperately in need of
> programmers that can create high-performance concurrent software. So, I
> guess HTM would be a sort of failsafe for a chip vendor. They don't seem to
> realize that their existing instruction sets can be made to scale to very
> many cores using low-overhead software-based synchronization algorithms...
Hmm. Multi-core chip vendors can include commercial vZOOM license with
every box :)
It would be like notebook with preset OS and other stuff. Do you
suggest this idea to them? :)
Dmitriy V'jukov
Well, I suggested this:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb87f9cb2cb62d84
to the following forum:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/1042/ShowForum.aspx
And have no reply. They deleted my first post, and I can't seen to find my
second post which survived for about a month it seems:
http://groups.google.com/group/comp.programming.threads/msg/b0efcf84079aa780
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/45eb52c737cf5997
:^0
[...]
That would be a fairy tale, no?
:^(...
[...]
http://groups.google.com/group/comp.programming.threads/msg/fd1c7c4ce94778c5
DAMN!
I think that you may be able to find ways to use HTM to make your low-
overhead software-based synchronization algorithms perform better than
they would otherwise. Think of HTM as a hint to the processor that a
sequence of dependent reads and writes is about to take place that it
should attempt to minimize contention with other processors. At the
lowest level I only need a couple of queue primitives that are
synchronized across processors. I don't really care how complex the
implementation ends up being, if HTM can make them simpler, faster or
more reliable than a lockless method then it is a win.
I can't quite see how vZOOM can benefit from HTM. The library tries to avoid
using interlocked rmw and/or memory barrier operations. A call to a HTM
instruction-set would , IMO, be an expensive call... BTW, there would have
to be two calls. A begin, and a subsequent commit. If your dealing with a
scenario that has millions of read and thousands of write transactions on
sort-of persistent basis, then I want to avoid calling interlocked anything,
let alone HTM. Please note that this is not because of a bias toward not
using HTM. I want to avoid calling all expensive operations, which HTM
instruction-set will most likely be a part of. Well, when compared to the
basic instructions rmw instructions at lease. BTW, how are you going to
design a HTM that can have a low-overhead cache-coherency? You can coarsen
the grain for TM to increase atomic-block performance, but then you increase
the overall live-lock!
That is two-calls in a loop that is not guaranteed to be committed on the
first try.... Please keep in mind that the loop comprises a full
atomic-block. This block may be iterating over large linked data-structures.
The depth of the linked collections are directly related to the level of
live-lock you will experience.
vZOOM can allow multiple reader thread to iterate throughout multiple
unrelated linked-data-structures _concurrently with multiple writer threads
making various modifications_, without _any_ livelock...
> Please note that this is not because of a bias toward not
> using HTM. I want to avoid calling all expensive operations, which HTM
> instruction-set will most likely be a part of.
> Well, when compared to the
> basic instructions rmw instructions at lease.
^^^^^^^^^^^^^
[...]
That sentence needs to read as:
'Well, when compared to basic interlocked rmw instruction at least.'
> It seems that with multi-core the vendors are desperately in need of
> programmers that can create high-performance concurrent software. So, I
> guess HTM would be a sort of failsafe for a chip vendor. They don't seem
> to realize that their existing instruction sets can be made to scale to
> very many cores using low-overhead software-based synchronization
> algorithms...
What "existing instruction sets" are you writing about?
That of Sparc, which doesn't include wide CAS in
64-bit mode? Or perhaps you mean IA-32 with its
lock-based instructions which are so slow that barely
usable? Been there, done that -- lock-free algorithms
are a great alternative for many lock-based solutions,
but even then their performance is simply a disaster.
Could the chip vendors give us an extremely fast
point-to-point communication framework which can
send a message from procesor A directly to processor
B instead of HTM, STM or similar insanities? There
is such a bus, based on the local APIC units and used
for IPIs, but it's way too slow and too primitive. It
is perfectly doable, for instance the Virtex 4 FPGA
chip has many multi-GHz differential lanes to provide
amazingly fast burst transfers. Why can't Core2
have 4 more pins (2 for diff in, 2 for diff out) and
pump its IPC data into the next CPU's receive buffer
at 2GHz and simultaneously receive something from
the previous CPU? A simple token ring topology
would be enough for small systems and a P2P
router chip can be added for the larger ones.
Best regards
Piotr Wyderski
> Been there, done that -- lock-free algorithms
> are a great alternative for many lock-based solutions,
> but even then their performance is simply a disaster.
Accidentally don't you talking about those bus-saturating 5-CAS/
operation algorithm by Michael, Valois etc? :)))
Dmitriy V'jukov
> And what about high-performance/scalable low-overhead custom
> synchronization methods? :)
"Low overhead" and "synchronization" in the same sentence?
There is no such beast. You may call it "lower overhead" and
you will be perfectly correct, but it is far from being "low".
Best regards
Piotr Wyderski
> Accidentally don't you talking about those bus-saturating 5-CAS/
> operation algorithm by Michael, Valois etc? :)))
No, about a single CAS/operation concurrent FIFO.
Best regards
Piotr Wyderski
What about lock-free reader pattern?
No interlocked rmw or expensive memory barriers at all on reader side.
Dmitriy V'jukov
What type of FIFO queue are you talking about?
Multiple-producer/multiple-consumer queue must have more CAS/
operation.
What type of FIFO queue can have single CAS/operation?
Dmitriy V'jukov
> What about lock-free reader pattern?
Hardly ever applicable.
Best regards
Piotr Wyderski
Why exactly? Please, explain in detail.
Dmitriy V'jukov
Current SPARC and IA-32/64 instruction set works fine. You don't really need
DWCAS. As for the lock prefix on x86, well, you can design around using
them. Have you seen my memory allocator algorithm that does this? I have
posted it to this group a couple of times.
> Been there, done that -- lock-free algorithms
> are a great alternative for many lock-based solutions,
> but even then their performance is simply a disaster.
You mean performance for poorly designed lock-free algorithms are a disaster
right?
> Could the chip vendors give us an extremely fast
> point-to-point communication framework which can
> send a message from procesor A directly to processor
> B instead of HTM, STM or similar insanities?
Sure, but message-passing algorithms are really not that much help for
certain synchronization needs. For instance, when your reader threads need
to iterate through large linked collections message-passing is not going to
help much...
[...]
I disagree. Look here for a simple example:
http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855
The algorithm I invented can implement multi-threaded allocator that can
scale to very many cores and has virtually zero-overheads. How can you say
that the overhead is far from being low? Here is a lock-free FIFO queue that
uses no interlocked rmw or store/load memory barriers:
http://groups.google.com/group/comp.lang.c++.moderated/msg/9bdc5f50be08d7d6
How can you say that algorithms is far from being low? BTW, that queue can
be multiplexed to provide high-speed message-passing between multiple CPU's.
I have more examples if you want.
You think most application don't have any "read-mostly" access patterns into
shared data-structures? Did you know that lock-free reader patterns can
usually be made to replace read-write locks?
I would be interested in looking at the algorithm. However, I can multiplex
single-producer/consumer queues in a way that can beat CAS-based FIFO...
I guess you could use the reversed LIFO stack trick:
http://groups.google.ca/group/comp.programming.threads/msg/a359bfb41c68b98b
However, that is a multiple-producer/single-consumer. Although, the depth of
the queue is proportional to the overhead of the reversal...
Tell that to Mckenney over on the Linux Kernel Mailing List!
;^)
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:3OGdnXm6gKyzhk7b...@comcast.com...
[...]
> I guess you could use the reversed LIFO stack trick:
>
> http://groups.google.ca/group/comp.programming.threads/msg/a359bfb41c68b98b
>
> However, that is a multiple-producer/single-consumer. Although, the depth
> of the queue is proportional to the overhead of the reversal...
The reversed LIFO stack trick _is_ a multi-producer/consumer. The _caveat_
is that a consumer cannot remove a single node from the queue, it's all or
nothing.
> > I guess you could use the reversed LIFO stack trick:
It's not one CAS/operation. It's rather to say, it's 0.5 CAS/
operation.
Dmitriy V'jukov
Well, I want to see the day when there are hundreds-of-cores per-chip...
:^)
> But NUMA have more complicated model, and it's harder to program for
> it... So this is no more the plain-old-SMP... I think that there is a
> handful of people which can effectively program for NUMA....
> And yes, this seems the only way to scalable systems...
Yeah. It's basically all about binding the actions of a cpu to its local
memory bank. Also, you kind of have use algorithms, such as
RCU/RCU-SMR/vZOOM, which can run under relaxed coherency schemes... Do we
really need ccNUMA? Na...
> Btw, AMD claims that their multicore chips is NUMA a little. There are
> the same distance between all cores and memory bank, but interface to
> memory bank is 2-way. And first core have priority in access on first
> way, and second core have priority in access on second way. As I
> understand.
Interesting. I think they are doing something that involves sticking
multiple CPU's, GPU's and their various dedicated co-processors on a single
chip.
:^)
> Could the chip vendors give us an extremely fast
> point-to-point communication framework which can
> send a message from procesor A directly to processor
> B
It depends, I seem to remember that the transputer used a fast serial
interface to connect processors. It was fast for the time but limited
the number of processors that could directly talk to each other.
Ken Young
DMA:
http://groups.google.com/group/comp.arch/msg/198a048b98fabe45
You can do message-passing with those instructions.
http://blogs.sun.com/dave/resource/cgo2007-TL1-STM-camera.pdf
Zombie Transactions... OUCH!
:^0
Well, at least that paper mentions some of the caveats.
They also seem to finally realize that lock-based TM is far superior to
lock-free TM.
> Well, at least that paper mentions some of the caveats.
If I have some heavy operation, and I don't want to put that operation
into transaction, because I don't want operation to retry several
times (operation is heavy). Than I must back-off from TM to locking.
So I revert to the old habit and to the old problems!
I think such situations (heavy operation) will stay in programs, so TM
doesn't simplify things, TM will complicate things even more. Because
I will have to use and TM and locking, and cope with both, and prevent
deadlocks, and watch they don't mix together.
Or I misunderstood something? Can someone elucidate?
Dmitriy V'jukov
> Current SPARC and IA-32/64 instruction set works fine.
> You don't really need DWCAS.
What do you mean by DWCAS? cmpxchg8b in 32-bit
mode or a true CAS on two unrelated memory locations?
If you mean the first option, then I really do know the
tricks, but they are pretty ugly, e.g. a separate 4GiB
memory area reserved solely for the lock-free algorithms.
> As for the lock prefix on x86, well, you can design around using them.
It doesn't matter whether you use the lock prefix or simulate
it somehow (for example, using the load-to-store forwarding
bypass trick based on mixed-size memory operands). The
problem is with the cache cocherence protocol, which has
tremendous overheads.
> Have you seen my memory allocator algorithm that does this?
No, I have not, but I have also designed a blind-fast parallel
allocator for the industry. The key to high performance is to
do NUMA-style processing on a UMA machine, that is,
separate the processing units as much as possible.
> You mean performance for poorly designed lock-free algorithms are a
> disaster right?
No, I mean that anything that requires strong coherence
between different processing units is a performance disaster
whatever technique you choose. It can be a lock-free
algorithm based on CAS, it can be a mutex-based approach
-- it simply will be slow. The lock-free way will scale up
much better, indeed, but even then the performance will
be poor compared to a piece of code executed on a
uniprocessor machine. "much better" doesn't mean "good".
> Sure, but message-passing algorithms are really not that much help for
> certain synchronization needs.
Surely, but they can fullfil many other needs perfectly.
Best regards
Piotr Wyderski
You do not have to use a reserved memory address-span. You can use offset
trick or simply not rely on algorithms that depend on DWCAS.
>> As for the lock prefix on x86, well, you can design around using them.
>
> It doesn't matter whether you use the lock prefix or simulate
> it somehow (for example, using the load-to-store forwarding
> bypass trick based on mixed-size memory operands). The
> problem is with the cache cocherence protocol, which has
> tremendous overheads.
I do not simulate the lock prefix in any way shape or form. However, I agree
with you that cache-coherency overhead is a real issue. I personally want
very relaxed models: I do not see any "real need" for super-strong coherency
schemes.
>> Have you seen my memory allocator algorithm that does this?
>
> No, I have not, but I have also designed a blind-fast parallel
> allocator for the industry. The key to high performance is to
> do NUMA-style processing on a UMA machine, that is,
> separate the processing units as much as possible.
Here is my invention:
----------
http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855
The algorithm I invented can implement multi-threaded allocator that can
scale to very many cores and has virtually zero-overheads. How can you say
that the overhead is far from being low? Here is a lock-free FIFO queue that
uses no interlocked rmw or store/load memory barriers:
http://groups.google.com/group/comp.lang.c++.moderated/msg/9bdc5f50be08d7d6
How can you say that algorithms is far from being low? BTW, that queue can
be multiplexed to provide high-speed message-passing between multiple CPU's.
I have more examples if you want.
----------
>> You mean performance for poorly designed lock-free algorithms are a
>> disaster right?
>
> No, I mean that anything that requires strong coherence
> between different processing units is a performance disaster
> whatever technique you choose.
vZOOM, RCU, RCU+SMR, and other like algorithms do not depend on strong
cache. In fact, they don't put any requirements on cache-coherency. They can
rely on weak models indeed...
> It can be a lock-free
> algorithm based on CAS, it can be a mutex-based approach
> -- it simply will be slow. The lock-free way will scale up
> much better, indeed, but even then the performance will
> be poor compared to a piece of code executed on a
> uniprocessor machine. "much better" doesn't mean "good".
>
>> Sure, but message-passing algorithms are really not that much help for
>> certain synchronization needs.
>
> Surely, but they can fullfil many other needs perfectly.
Yup. It depend on what your needs are exactly...
;^)
> No, I have not, but I have also designed a blind-fast parallel
> allocator for the industry. The key to high performance is to
> do NUMA-style processing on a UMA machine, that is,
> separate the processing units as much as possible.
How do you solve the problem when application have steady pattern
which causes memory to be allocated in one thread and freed in
another? For example producer-consumer pattern.
Dmitriy V'jukov
>> No, about a single CAS/operation concurrent FIFO.
>
> I would be interested in looking at the algorithm.
I bet you would. :-]
> However, I can multiplex single-producer/consumer
> queues in a way that can beat CAS-based FIFO...
But you cannot implement an extremely fine-grained
multi-producer/multi-consumer FIFO which is free
of the coherence overhead. With 8+ CPUs spinning
on such a queue your bus simply goes crazy. One
must use different approaches + a lot of cheating to
achieve similar results.
Best regards
Piotr Wyderski
Well, here is the queue:
http://groups.google.com/group/comp.lang.c++.moderated/msg/9bdc5f50be08d7d6
I don't spin on the queue. I use an eventcount algorithm I invented:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/aa8c62ad06dbb380
This gives high-performance condvar-like ability to many different types of
lock-free algorithms.
> How do you solve the problem when application have steady pattern
> which causes memory to be allocated in one thread and freed in
> another? For example producer-consumer pattern.
I detect this situation and give the block back to the producer
using a lock-free "slow path". The producer's allocator does
not touch that blocks if not absolutely necessary and do it in
batches to spread the cost of cache coherence.
Best regards
Piotr Wyderski
It looks very similar to Chris Thomasson's invention (by this brief
description).
I can suppose that you use "reversed" multiple-producer/single-
consumer fifo queue based on lifo-stack to pass blocks to "producer"
thread. And "producer" grabs all queue by one XCHG ;)
Dmitriy V'jukov
That how my invention does it. BTW, you should contact:
http://groups.google.com/group/comp.programming.threads/msg/0156c6caa9bd7f69
I have had lengthy discussion with the professor and his PHD student. I can
prove that I invented my algorithm in 2005. I use it in a commercial
product. BTW why didn't you comment on my post!:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7eb03aabbe31c855
Do you have a patent application on it?
God damn typo!
I can prove I invented the allocator, in-house, in 2004. I posted the algo
in comp.programming.threads in 2005.
> You do not have to use a reserved memory address-span.
> You can use offset trick
Or you can use indirect addressing based on an "index to address"
translation LUT and many other things. But it is ugly. Not as ugly
as SMR or LFRC, but ugly enough not to use it when it is not critical.
> However, I agree with you that cache-coherency overhead is a real issue.
It limits the capacity of your inter-CPU communication channel to
mere ten million of FIFO transactions per second (when you are
lucky). I can get hundreds of millions when "cheating", that is, the
processor serves its own requests, so it is not a true FIFO queue.
My tasks are fine-grained, so it makes a big difference.
> I personally want very relaxed models: I do not see any
> "real need" for super-strong coherency schemes.
It doesn't help in many situations. The relaxed consisency models
are useful when you pass simple objects, but often a pointer to
a very complex object is being passed through the queue. Then
you must use a full-blown pair of acqire and release barriers no
matter what -- the view of the content of the received object
_must_ be consistent.
> How can you say that the overhead is far from being low?
Perhaps that is because we use different bases for comparison.
In my case it is the performance of an uniprocessor machine,
yours seems to be the lock-based approach.
> Here is a lock-free FIFO queue that
> uses no interlocked rmw or store/load memory barriers:
Does it guarantee consistent view of an object passed by pointer?
It does not, so it doesn't matter whether or not the FIFO itself
requires a barrier. The contents require it.
> How can you say that algorithms is far from being low?
It's quite easy, I can teach you if you wish. :-)
Best regards
Piotr Wyderski
Yes it does; please note that current x86 has #LoadStore | #StoreStore
barrier implied for stores, and #LoadStore | #LoadLoad imples for loads.
Please look at the code:
-----------
align 16
ac_i686_queue_spsc_push PROC
mov eax, [esp + 4]
mov ecx, [esp + 8]
mov edx, [eax + 4]
; sfence may be needed on future x86
mov [edx], ecx
mov [eax + 4], ecx
ret
ac_i686_queue_spsc_push ENDP
align 16
ac_i686_queue_spsc_pop PROC
push ebx
mov ecx, [esp + 8]
mov eax, [ecx]
cmp eax, [ecx + 4]
je ac_i686_queue_spsc_pop_failed
mov edx, [eax]
; lfence may be needed on future x86
mov ebx, [edx + 12]
mov [ecx], edx
mov [eax + 12], ebx
pop ebx
ret
ac_i686_queue_spsc_pop_failed:
xor eax, eax
pop ebx
ret
ac_i686_queue_spsc_pop ENDP
-----------
Notice how I say/comment that s/lfence instructions might be need on future
intel cpu's? If you want this on the SPARC on non-TSO mode, well, you need a
#StoreStore. You don't need full acquire release. #StoreLoad is not
required here. If you on Alpha, then you need membar for producer AND membar
for consumer.
> How can you say that algorithms is far from being low?
>> It's quite easy, I can teach you if you wish. :-)
How is my allocator invention, eventcount or my queue not low-overhead?
> I can suppose that you use "reversed" multiple-producer/single-
> consumer fifo queue based on lifo-stack to pass blocks to "producer"
> thread. And "producer" grabs all queue by one XCHG ;)
Precisely, I use this ancient trick among others. :-)
Best regards
Piotr Wyderski
Why didn't you mention this earlier? Your 1-cas per-operation fifo can be
improved.
I suppose that you use the same reference counting magic I created for my
allocator? Its a fairly neat technique, IMO at least.
> That how my invention does it.
I don't know your invention, so can't compare our approaches.
However, this way is pretty natural, so I am not surprised that
there are other users of this trick.
> BTW, you should contact:
Contact whom and for what purpose?
> I have had lengthy discussion with the professor and his PHD
> student. I can prove that I invented my algorithm in 2005.
Good for you, but what's going on? I don't claim that you
are not the author of your algorithm, I don't even know it.
I will read your documentation when I find a moment of
spare time, no problem. But I will not discuss about the
differences in our design in public, these things are too
valuable. You know my e-mail address, Chris, as there
are your previous mails in the mailbox. :-)
Mine dates 2002 or so, but who cares?
> I use it in a commercial product.
So do I. The performance is unbelievable. :-)
> BTW why didn't you comment on my post!:
I don't read comp.programming.threads, if I see your
posts, it's when you send them to comp.arch.
> Do you have a patent application on it?
On what, exactly? Owner detection? Inverted FIFO?
Victim block caching? Pool autobalancing?
Best regards
Piotr Wyderski
> I suppose that you use the same reference counting magic I created for my
> allocator? Its a fairly neat technique, IMO at least.
I don't have any reference counting in my allocator.
Best regards
Piotr Wyderski
PS. private mail, please... :-)
BTW, Piotr, did you think if using the allocator syncronization schem to
allow the transformation of an inherently single-threaded malloc/free into a
full-blown scaleable multi-threaded one:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb87f9cb2cb62d84
Fir instance, I used the invention to transform a single-threaded windows
Heap API:
http://msdn2.microsoft.com/en-us/library/aa366711.aspx
into a scaleable multi-threaded allocator. You use the HEAP_NO_SERIALIZE
flag for the HeapCreate function:
http://msdn2.microsoft.com/en-us/library/aa366599.aspx
in order to make it single-threaded..
The funny thing about this is... Well, the transformed windows heap beats
the pants of the multi-threaded heap created without the HEAP_NO_SERIALIZE
flag set...
> Why didn't you mention this earlier? Your 1-cas per-operation fifo can be
> improved.
No, the FIFO part (call it "execution engine") and the memory
manager are completely separated and I don't want to use this
excessive knowledge about block internals to implement better
lock-free algorithms on top of it. In fact I can't, because there
is a SCRAM compilation switch to disable my allocator in
case of emergency, because it has proven to be "too parallel"
in one of our old products -- there were some race conditions
and didn't show up only because of the internal synchronization
protocol of the original malloc() :-D
It has been fixed, of course, but the "SCRAM" is still there
(there are many others, in fact) and I want my lock-free
algorithms to work in emergency mode too, so there is no
room for improvements, at least of this kind.
Best regards
Piotr Wyderski
The people behind the StreamFlow allocator. I had to inform them that their
paper subission to
----
Proceedings of the 2006 ACM SIGPLAN International Symposium on Memory
Management, Ottawa, Canada, June 2006.
----
is based on prior-art.
>> I have had lengthy discussion with the professor and his PHD
>> student. I can prove that I invented my algorithm in 2005.
>
> Good for you, but what's going on? I don't claim that you
> are not the author of your algorithm, I don't even know it.
> I will read your documentation when I find a moment of
> spare time, no problem. But I will not discuss about the
> differences in our design in public, these things are too
> valuable. You know my e-mail address, Chris, as there
> are your previous mails in the mailbox. :-)
This is already out in public. So, might as well discuss it here?
:^0
> Mine dates 2002 or so, but who cares?
>
>> I use it in a commercial product.
>
> So do I. The performance is unbelievable. :-)
Yes, the peformance is extreme indeed!
:^)
>> BTW why didn't you comment on my post!:
>
> I don't read comp.programming.threads, if I see your
> posts, it's when you send them to comp.arch.
>
>> Do you have a patent application on it?
>
> On what, exactly? Owner detection? Inverted FIFO?
> Victim block caching? Pool autobalancing?
Here are some of the most relevant claims:
_____________________________
1. I claim use of no interlocked rmw instruction and/or memory
barrier in the fast-path of a local allocations or deallocations.
2. I claim use of a single interlocked SWAP instruction an no memory
barrier for the slow-path of a local allocation.
3. I claim use of interlocked Single CAS instruction and #StoreStore
memory barrier in the slow-path of a remote deallocation.
3a. I claim the interlocked Single CAS in claim 3 will operate on
memory locations that are the size of a system word.
4. I claim use of a single interlocked SWAP instruction #LoadStore
memory barrier in thread termination.
_____________________________
Anything like that?
[...]
I sure would!
:^)
> The people behind the StreamFlow allocator. I had to inform them that
> their paper subission to
>
> ----
> Proceedings of the 2006 ACM SIGPLAN International Symposium on Memory
> Management, Ottawa, Canada, June 2006.
> ----
>
> is based on prior-art.
I don't have to inform them, my prior art is much older than their paper.
But I'll check whether they use something similar to my approach.
> This is already out in public.
All the public knows is that there are two extremely fast memory
allocators out there and that both of them have something called
"slow path". It is far enough for them. ;-)
> So, might as well discuss it here?
No, industrial espionage, you know. :-) Chris, we are talking
about a marvel that gives our respective companies serious
performance advantages. I can swear you that I will not discuss
about its guts in public. :-)
> Yes, the peformance is extreme indeed!
I have done a lot to make its fast path to be really FAST. :-)
> _____________________________
> 1. I claim use of no interlocked rmw instruction and/or memory
> barrier in the fast-path of a local allocations or deallocations.
Agreed.
> 2. I claim use of a single interlocked SWAP instruction an no memory
> barrier for the slow-path of a local allocation.
I don't understand it, because it requires the knowledge of your
algorithm, so please be patient. But I have no such thing as a slow
path of local allocation (well, there is, but it is extremely rare), so
probably the answer is "no".
> 3. I claim use of interlocked Single CAS instruction and #StoreStore
> memory barrier in the slow-path of a remote deallocation.
At most one CAS in this case.
> 3a. I claim the interlocked Single CAS in claim 3 will operate on
> memory locations that are the size of a system word.
It is not a claim, but a property of the underlying system.
> 4. I claim use of a single interlocked SWAP instruction #LoadStore
> memory barrier in thread termination.
No, then my allocator follows a lock-based path (which implies
a full barrier), because there is no need to optimize this case.
Best regards
Piotr Wyderski
> Yes it does; please note that current x86 has #LoadStore | #StoreStore
> barrier implied for stores, and #LoadStore | #LoadLoad imples for loads.
Yes, I know this, but my code must run on many other architectures
(SPARC, POWER, PA-RISC, there is a rumor about the Itanium
family), so I must specify the desired consistency semantics of CAS
manually (via a template parameter, its C++ with a bit of the assembly
language). The resulting code contains no barrier instruction on the
IA-32/EMT64 family, indeed. But logically thre is a barrier. For the
same reason I have consistency-augmented versions of load() and
store(), which do not even require atomic instructions on IA-32.
> Notice how I say/comment that s/lfence instructions might be need on
> future intel cpu's? If you want this on the SPARC on non-TSO mode, well,
> you need a #StoreStore.
Surely, been there... :-)
> If you on Alpha, then you need membar for producer AND membar for
> consumer.
Fortunately Aplha is dead, its barrier on dependent
loads was a serious pain in the, erm, call it neck...
> How is my allocator invention, eventcount or my queue not low-overhead?
Give me a couple of days. :-)
Best regards
Piotr Wyderski
> No, industrial espionage, you know. :-) Chris, we are talking
> about a marvel that gives our respective companies serious
> performance advantages. I can swear you that I will not discuss
> about its guts in public. :-)
As I understand you don't want to reveal any details of your
algorithms and products you are working on. But can you answer please
to 2 harmless questions? ;) I am just interested in what field such
technics are relevant. And in what field companies are interested in
such specialists.
What software you are producing? I can suppose that this is server
systems :) Is it telecom or VoIP or what?
On what hardware software is working? Is it commodity SMP/multicore
x86 or this is "big" servers (SUN or IBM)?
Dmitriy V'jukov
> I am just interested in what field such technics are relevant.
Whenever you want to process a tremendous amount of
data as fast as possible using a big multiprocessor machine.
> And in what field companies are interested in such specialists.
In my case its the financial/banking sector.
> What software you are producing?
System integrators, business model execution frameworks,
dataflow processing, various virtual machines etc.
> I can suppose that this is server systems :)
Yes, you may call it so.
> On what hardware software is working?
On almost everything that has ever be invented, frankly. :-)
Name a big enough "serious" processor familly and it will
be a bull's eye hit.
> Is it commodity SMP/multicore x86 or this is "big" servers (SUN or IBM)?
All of them (you know, it is much cheaper to build a cluster of commodity
machines and it works better than a mainframe), but the "big iron" segment
is our primary target.
Best regards
Piotr Wyderski
All want to process data as fast as possible... but it seems that
nobody hires such specialists :)
Ok. Thank you.
Dmitriy V'jukov
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/f2c94118046142e8
:^0
Okay. Well, perhaps we can have a private conversation when we get some free
time. Should be both sign a NDA?
>> Yes, the peformance is extreme indeed!
>
> I have done a lot to make its fast path to be really FAST. :-)
>
>> _____________________________
>> 1. I claim use of no interlocked rmw instruction and/or memory
>> barrier in the fast-path of a local allocations or deallocations.
>
> Agreed.
Yup. This is totally equal to single-thread performance in this case.
>
>> 2. I claim use of a single interlocked SWAP instruction an no memory
>> barrier for the slow-path of a local allocation.
>
> I don't understand it, because it requires the knowledge of your
> algorithm, so please be patient. But I have no such thing as a slow
> path of local allocation (well, there is, but it is extremely rare), so
> probably the answer is "no".
The local-allocation slow-path is hit when the calling threads allocator
exhausts its memory buckets.
>> 3. I claim use of interlocked Single CAS instruction and #StoreStore
>> memory barrier in the slow-path of a remote deallocation.
>
> At most one CAS in this case.
Yup.
>> 3a. I claim the interlocked Single CAS in claim 3 will operate on
>> memory locations that are the size of a system word.
>
> It is not a claim, but a property of the underlying system.
Agreed.
>> 4. I claim use of a single interlocked SWAP instruction #LoadStore
>> memory barrier in thread termination.
>
> No, then my allocator follows a lock-based path (which implies
> a full barrier), because there is no need to optimize this case.
I am interested in how you integrate the locking into the mix; perhaps I
will send you an e-mail. My allocator usually runs on a method of using a
threads actual stack as the memory pool, and I do use POSIX condition
variables in the commercial version of the code. The locking comes after
claim 4.
___PS___
Well, at least I am sure that we developed the algorithms independently
without any exchange of information before hand. So, we have invented some
of the best-performance memory allocators "currently" out there! :^)
How could a _strictly conforming_ malloc/free, like my allocator and I am
assuming yours as well, be a bad thing? The underlying implementation of
malloc/free should be completely opaque and of no concern to any standard
conforming application. Humm, you mean it blew something apart with prior
race-condition(s) being exposed due to the increased parallelism? I have
dealt with this issue on some consulting work before; something like this:
http://groups.google.com/group/comp.programming.threads/msg/9a5500d831dd2ec7
Good times! ;^) As for the queue implementation, I can't quite see how you
design the queue with 1-CAS per-op. You must be packing the head-and-tail
into a single word or something. Are you using a bounded queue?
Yes. My experimental AppCore library, at this time, is for the x86 only. My
commercial library, vZOOM, currently runs on 32/64-bit SPARC, 32/64-bit
POWER, IA-32/64 and the ARM9. I also have a _highly experimental_ version
for the Cell BE. Programming lock-free stuff for the Cell is interesting to
say the least, stuff like LL/SC augmented with DMA... Doing the ARM9 version
was pretty interesting as well, the only atomic operation was SWP. I had
somebody interested in licensing the library for use in a tightly-packed
embedded environment running Quadros. I hade to simulate a heap using the
stacks of Quadros "Tasks"... Luckily for me, the allocator runs on thread
stacks anyway.
> (SPARC, POWER, PA-RISC, there is a rumor about the Itanium
> family), so I must specify the desired consistency semantics of CAS
> manually (via a template parameter, its C++ with a bit of the assembly
> language). The resulting code contains no barrier instruction on the
> IA-32/EMT64 family, indeed. But logically thre is a barrier. For the
> same reason I have consistency-augmented versions of load() and
> store(), which do not even require atomic instructions on IA-32.
Yeah. A coherent abstraction over atomic operations can work out nice.
>> Notice how I say/comment that s/lfence instructions might be need on
>> future intel cpu's? If you want this on the SPARC on non-TSO mode, well,
>> you need a #StoreStore.
>
> Surely, been there... :-)
:^)
>> If you on Alpha, then you need membar for producer AND membar for
>> consumer.
>
> Fortunately Aplha is dead, its barrier on dependent
> loads was a serious pain in the, erm, call it neck...
No shi%. IIRC, you hade to use a full-blown MB instruction.
>> How is my allocator invention, eventcount or my queue not low-overhead?
>
> Give me a couple of days. :-)
Okay. ;^)
> Well, perhaps we can have a private conversation when we get some free
> time.
Sounds great. :-)
> Should be both sign a NDA?
If you wish, but a gentlemen's agreement would be fine with me.
> Yup. This is totally equal to single-thread performance in this case.
And this kind of performance can also be boosted.
> The local-allocation slow-path is hit when the calling threads allocator
> exhausts its memory buckets.
Well, there finally is a lock in the underlying malloc(),
when the buckets are all empty and you need to refill
them -- and you can do nothing abou that.
> Well, at least I am sure that we developed the algorithms independently
> without any exchange of information before hand.
Yes, I am totally convinced about the independence of our efforts
to design that memory allocators. It was so independent that I still
don't even know whether they are similar or not. :o)
Best regards
Piotr Wyderski
> > Yup. This is totally equal to single-thread performance in this case.
>
> And this kind of performance can also be boosted.
What do you mean? That single-threaded allocation algorithm can be
optimized? Or synchronization algorithm can be optimized? But how can
multithreaded allocator be faster than single-threaded?
Dmitriy V'jukov
> It limits the capacity of your inter-CPU communication channel to
> mere ten million ofFIFOtransactions per second (when you are
> lucky). I can get hundreds of millions when "cheating", that is, the
> processor serves its own requests, so it is not a trueFIFOqueue.
> My tasks are fine-grained, so it makes a big difference.
I can suppose that you use something like ABP work stealing deque.
Something like this:
http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf
Am I right?
If so, I am interesting to what limit you optimize it? It seems that
you use only highly optimized solutions ;)
How many CAS'es or expensive memory barriers do you have in pop()/
push() and steal() functions?
Dmitriy V'jukov
> What do you mean? That single-threaded allocation algorithm can be
> optimized?
Yes.
Best regards
Piotr Wyderski
I don't think you missed anything...
:^0
[...]
You can keep the queue and the memory manager completely separated and
_still_ use, as Dmitriy calls it, the .5%-cas per-operation fifo technique.
:^)
[...]
Index trick is not that ugly:
http://groups.google.com/group/comp.arch/msg/7737753cc958dafb
:^)
> > Btw,AMDclaims that their multicore chips isNUMAa little. There are
> > the same distance between all cores andmemorybank, but interface to
> >memorybank is 2-way. And first core have priority in access on first
> > way, and second core have priority in access on second way. As I
> > understand.
>
> Interesting. I think they are doing something that involves sticking
> multiple CPU's, GPU's and their various dedicated co-processors on a single
> chip.
See this article for details
THE AMD OPTERON NORTHBRIDGE ARCHITECTURE:
http://www.computer.org/portal/cms_docs_micro/micro/homepage/2007/m2010.pdf
Dmitriy V'jukov
Coherent HyperTransport... Scaleable DMA?
There's some stuff here
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf
They're presuming the OS scheduler is ccNUMA aware and does the right thing
as far as the application is concerned which I think means non ccNUMA
aware applications only get scheduled on one chip. Otherwise you'd get
non ccNUMA aware applications maxing out cache coherency traffic on the
hyperlinks.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
Oh my... I think your dead on the money with your clever observations Joe!