C++ multithreading: mem

Sergey P. Derevyago

unread,

Aug 6, 2008, 11:31:44 AM8/6/08

to

* I've written a C++ multithreading tutorial. Here it is if you can read
Russian: http://ders.stml.net/cpp/mtprog/mtprog.html.
* The source code is http://ders.stml.net/cpp/mtprog/code.zip and it has
English comments.
* Doxygen is http://ders.stml.net/cpp/mtprog/doc/index.html

I have no time to create a full English translation so I'm going to
describe only several important issues.

First of all: Use thread-private memory allocators wherever possible
because typical C++ operators new/delete ARE TERRIBLY SLOW! New/delete
don't allow for concurrent or simultaneous execution so they should not
be blindly used in well designed MT programs!

Please take a look at example1/main.cpp living at
http://ders.stml.net/cpp/mtprog/mtprog.html#3.1
-----------------------------------8<-----------------------------------
void start_std(void*)
{
list<int> lst;
for (int i=0; i<N; i++) {
for (int j=0; j<M; j++) lst.push_back(j);
for (int j=0; j<M; j++) lst.pop_front();
}
}
void start_ders(void*)
{
mem_pool mp;
stl_allocator<int> alloc(mp);

list<int, stl_allocator<int> > lst(alloc);
for (int i=0; i<N; i++) {
for (int j=0; j<M; j++) lst.push_back(j);
for (int j=0; j<M; j++) lst.pop_front();
}
}
-----------------------------------8<-----------------------------------
The table at http://ders.stml.net/cpp/mtprog/mtprog.html#3.1.1 shows
that start_ders() function is tens and hundreds (!!!) of times faster:
1 CPU : 38.1
1 CPU,HT : 42.1
2 CPU,SMP: 321.6 (!!!)
2 CPU,HT : 37.0

Please don't say IMPOSSIBLE! ;) Just download code.zip and try
mtprog\examples\example1\main.cpp yourself: with your compiler and box.
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net

Dmitriy V'jukov

unread,

Aug 6, 2008, 12:51:51 PM8/6/08

to

On Aug 6, 7:31 pm, "Sergey P. Derevyago" <non-exist...@iobox.com>
wrote:

> * I've written a C++ multithreading tutorial. Here it is if you can read
> Russian:http://ders.stml.net/cpp/mtprog/mtprog.html.

> * The source code ishttp://ders.stml.net/cpp/mtprog/code.zipand it has

> English comments.
> * Doxygen ishttp://ders.stml.net/cpp/mtprog/doc/index.html

You conclude that Hyper-Threaded processor is worse than non Hyper-
Threaded processor. Have you applied optimization techniques described
in "Intel 64 and IA-32 Architectures Optimization Reference Manual"?
Most notably:
- Eliminate 64-KByte Aliased Data Accesses
- Preventing Excessive Evictions in First-Level Data Cache
- Using Shared Execution Resources in a Processor Core
I think those can have great impact.

Dmitriy V'jukov

Sergey P. Derevyago

unread,

Aug 6, 2008, 1:08:27 PM8/6/08

to

Dmitriy V'jukov wrote:
> You conclude that Hyper-Threaded processor is worse than non Hyper-
> Threaded processor.
>

Not exact.
From the _scalability_ POV one HT CPU is worse then one non-HT CPU for
_this kind of tasks._

> Have you applied optimization techniques described
> in "Intel 64 and IA-32 Architectures Optimization Reference Manual"?
>

There was no point to explicitly code for HT processors.
HT quirks are out of scope.
Nevertheless, please feel free to apply any optimizations you want and
report the results...

Chris M. Thomasson

unread,

Aug 6, 2008, 9:22:13 PM8/6/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message
news:4899c3e1$0$90276$1472...@news.sunsite.dk...

>* I've written a C++ multithreading tutorial. Here it is if you can read
>Russian: http://ders.stml.net/cpp/mtprog/mtprog.html.
> * The source code is http://ders.stml.net/cpp/mtprog/code.zip and it has
> English comments.
> * Doxygen is http://ders.stml.net/cpp/mtprog/doc/index.html
>
> I have no time to create a full English translation so I'm going to
> describe only several important issues.
>
> First of all: Use thread-private memory allocators wherever possible
> because typical C++ operators new/delete ARE TERRIBLY SLOW! New/delete
> don't allow for concurrent or simultaneous execution so they should not be
> blindly used in well designed MT programs!

100% totally agreed. However, new/delete can be hooked up to a memory
allocator that is based on nothing but per-thread heaps. Indeed we have
discussed this before:

http://groups.google.com/group/comp.programming.threads/msg/bc019dc04362be41

http://groups.google.com/group/comp.programming.threads/msg/1d3aeaa6ebf5181f

http://groups.google.com/group/comp.programming.threads/msg/907573c0e81dba56

http://groups.google.com/group/comp.programming.threads/msg/f5ea5a977a156a03
(I finally got you to agree that my design makes sense... ;^)

http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855

Anyway, your design makes sense. There is one question... Can one thread
allocate memory block, pass it to another which ultimately frees it? If not,
then how are you going to handle producer/consumer?

[...]

Sergey P. Derevyago

unread,

Aug 7, 2008, 4:28:50 AM8/7/08

to

Chris M. Thomasson wrote:
> Can one thread
> allocate memory block, pass it to another which ultimately frees it? If
> not, then how are you going to handle producer/consumer?
>

1. Thread-private allocator
http://ders.stml.net/cpp/mtprog/doc/classders_1_1mem__pool.html means
that its memory blocks doesn't cross thread boundaries. I.e. you can't
get a block from Alloc1 in Thr1 and free it in Thr2.
2. The common solution is serialization through synchronized queues
http://ders.stml.net/cpp/mtprog/doc/classders_1_1data__queue.html :
serialize your data in Thr1 and reconstruct it back in Thr2 using its
Alloc2.
3. Yes, I understand that there can be really big data chunks to pass
between threads. The obvious solution is to use usual new/delete to
allocate this big data and pass a pointer rather then serialize the
whole memory block.

BTW I've written several real-world application using derslib. Just
download http://ders.stml.net/cpp/mtprog/code.zip and look at (sorted by
complexity)
mtprog\src\mtprog\mtftext : just a tutorial sample of trivial grep
mtprog\src\mtprog\mtcksrc : checks source code files
mtprog\src\mtprog\mtdel : deletes files by mask[s]
mtprog\src\mtprog\mtcnvsrc : converts source code files
mtprog\src\mtprog\mtdirdiff: compares directories

The sources are well-written so you'll get no problems to see what's
going on.

Chris M. Thomasson

unread,

Aug 7, 2008, 6:30:54 AM8/7/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:489ab242$0$90264$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> Can one thread allocate memory block, pass it to another which ultimately
>> frees it? If not, then how are you going to handle producer/consumer?
>>
> 1. Thread-private allocator
> http://ders.stml.net/cpp/mtprog/doc/classders_1_1mem__pool.html means that
> its memory blocks doesn't cross thread boundaries. I.e. you can't get a
> block from Alloc1 in Thr1 and free it in Thr2.
> 2. The common solution is serialization through synchronized queues
> http://ders.stml.net/cpp/mtprog/doc/classders_1_1data__queue.html :
> serialize your data in Thr1 and reconstruct it back in Thr2 using its
> Alloc2.

Okay. Can I use my own queue algorithm to create an efficient
producer/consumer pattern using your mem-pool? Something like:

thread-1
-----------------------------------
void* mem = your_malloc(...);
my_queue_push(..., mem);

thread-2
-----------------------------------
void* mem = my_queue_pop_wait(...);
your_free(mem);

?

P.S.

Try not to flame me if I am wrong. I have not downloaded your source code
yet. I have limited time right now.

;^(...

[...]

Sergey P. Derevyago

unread,

Aug 7, 2008, 9:36:19 AM8/7/08

to

Chris M. Thomasson wrote:
> Okay. Can I use my own queue algorithm to create an efficient
> producer/consumer pattern using your mem-pool? Something like:
>
> thread-1
> -----------------------------------
> void* mem = your_malloc(...);
> my_queue_push(..., mem);
>
>
> thread-2
> -----------------------------------
> void* mem = my_queue_pop_wait(...);
> your_free(mem);
>
> ?
>

No, you can't.
The point is that there is no your_malloc() ;)

Look,
1. I have a class: ders::mem_pool
2. You can create an instance: mem_pool mp;
3. And use it to get some memory: void* prt=mp.allocate(123);

Class mem_pool is _intentionally_ non-synchronized, it supposed to be
used as a thread-private mem_pool!

Well, you can say that ubiquitous mem_pool& mp references are really
boring. Yes, you are right: thread-private mem_pools have their price.
And the price is: your applications will run tens and even hundreds of
times faster!

Chris M. Thomasson

unread,

Aug 7, 2008, 9:24:13 PM8/7/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:489afa54$0$90272$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> Okay. Can I use my own queue algorithm to create an efficient
>> producer/consumer pattern using your mem-pool? Something like:
>>
>> thread-1
>> -----------------------------------
>> void* mem = your_malloc(...);
>> my_queue_push(..., mem);
>>
>>
>> thread-2
>> -----------------------------------
>> void* mem = my_queue_pop_wait(...);
>> your_free(mem);
>>
>> ?
>>
> No, you can't.
> The point is that there is no your_malloc() ;)
>
> Look,
> 1. I have a class: ders::mem_pool
> 2. You can create an instance: mem_pool mp;
> 3. And use it to get some memory: void* prt=mp.allocate(123);
>
> Class mem_pool is _intentionally_ non-synchronized, it supposed to be used
> as a thread-private mem_pool!

Fine. But, as we have discussed before, there is an algorithm in which
memory from private thread pool A can be freed on private thread pool B
while preserving all the performance benefits of local unsynchronized
allocation. Basically, there is no synchronization for thread local
allocations and deallocations. However, for remote deallocations there is
lock-free operation.

> Well, you can say that ubiquitous mem_pool& mp references are really
> boring. Yes, you are right: thread-private mem_pools have their price. And
> the price is: your applications will run tens and even hundreds of times
> faster!

I was thinking along the lines of a general purpose allocation (e.g.,
malloc/free) which to completely based on thread-private memory pools.

Chris M. Thomasson

unread,

Aug 7, 2008, 9:36:11 PM8/7/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:489ab242$0$90264$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> Can one thread allocate memory block, pass it to another which ultimately
>> frees it? If not, then how are you going to handle producer/consumer?
>>
> 1. Thread-private allocator
> http://ders.stml.net/cpp/mtprog/doc/classders_1_1mem__pool.html means that
> its memory blocks doesn't cross thread boundaries. I.e. you can't get a
> block from Alloc1 in Thr1 and free it in Thr2.
> 2. The common solution is serialization through synchronized queues
> http://ders.stml.net/cpp/mtprog/doc/classders_1_1data__queue.html :
> serialize your data in Thr1 and reconstruct it back in Thr2 using its
> Alloc2.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

IMHO, the serialization has "unnecessary" overhead when compared to simply
passing a pointer through a queue. BTW, just to clear things up, if thread A
passes object1 over to thread B, using your scheme, thread B would not
contain the actual object1, but a brand new copy of it instead right? How do
you handle scenarios in which a _single_ object needs to be maintained
between a producer and consumer? For instance, how do I pass objects between
threads which cannot be copied? e.g, perhaps it has a private copy
constructor or something along those lines...

> 3. Yes, I understand that there can be really big data chunks to pass
> between threads. The obvious solution is to use usual new/delete to
> allocate this big data and pass a pointer rather then serialize the whole
> memory block.
>
> BTW I've written several real-world application using derslib. Just
> download http://ders.stml.net/cpp/mtprog/code.zip and look at (sorted by
> complexity)
> mtprog\src\mtprog\mtftext : just a tutorial sample of trivial grep
> mtprog\src\mtprog\mtcksrc : checks source code files
> mtprog\src\mtprog\mtdel : deletes files by mask[s]
> mtprog\src\mtprog\mtcnvsrc : converts source code files
> mtprog\src\mtprog\mtdirdiff: compares directories

> The sources are well-written so you'll get no problems to see what's going
> on.

Agreed. Unfortunately, I have only had time to briefly skim over the
code...

;^(

Sergey P. Derevyago

unread,

Aug 8, 2008, 2:47:06 AM8/8/08

to

Chris M. Thomasson wrote:
> Fine. But, as we have discussed before, there is an algorithm in which
> memory from private thread pool A can be freed on private thread pool B
> while preserving all the performance benefits of local unsynchronized
> allocation. Basically, there is no synchronization for thread local
> allocations and deallocations. However, for remote deallocations there
> is lock-free operation.
>

Lock-free doesn't ALWAYS mean fast, you know.
Could you please give a direct link to "preserving all the performance
benefits of local unsynchronized allocation"? I have some doubts...

Sergey P. Derevyago

unread,

Aug 8, 2008, 3:00:23 AM8/8/08

to

Chris M. Thomasson wrote:
> IMHO, the serialization has "unnecessary" overhead when compared to
> simply passing a pointer through a queue.
>

As a rule, the need to use _synchronized_ mem_pool-s is order(s) of
magnitude slower.
Also there is no need to pass (i.e. serialize/restore) some message to
another worker thread if it can be processed "in place" faster.

> BTW, just to clear things up,
> if thread A passes object1 over to thread B, using your scheme, thread B
> would not contain the actual object1, but a brand new copy of it instead
> right?
>

Sure.

> How do you handle scenarios in which a _single_ object needs to
> be maintained between a producer and consumer? For instance, how do I
> pass objects between threads which cannot be copied? e.g, perhaps it has
> a private copy constructor or something along those lines...
>

In essence, your statement is "but if I need X then Y..."
But why do you need X? Could you please show me some _realworld_
example where you have to copy non-copyable objects between threads?

Dmitriy V'jukov

unread,

Aug 8, 2008, 3:41:44 AM8/8/08

to

On 8 авг, 11:00, "Sergey P. Derevyago" <non-exist...@iobox.com> wrote:
> > IMHO, the serialization has "unnecessary" overhead when compared to
> > simply passing a pointer through a queue.
>
> As a rule, the need to use _synchronized_ mem_pool-s is order(s) of
> magnitude slower.

On local operations synchronized mempool has precisely the same
overheads as single-threaded mempool. Well, Ok, SMART synchronized
mempool.

> Also there is no need to pass (i.e. serialize/restore) some message to
> another worker thread if it can be processed "in place" faster.

But consider following situation.
serialize_restore_overhead == processing_time
time_to_pass_object_by_pointer << processing_time

Dmitriy V'jukov

Sergey P. Derevyago

unread,

Aug 8, 2008, 5:08:04 AM8/8/08

to

Dmitriy V'jukov wrote:
> On local operations synchronized mempool has precisely the same
> overheads as single-threaded mempool. Well, Ok, SMART synchronized
> mempool.
>

Impossible.
This time I don't want to cite theoretical stuff why synchronization is
ALWAYS worse than no synchronization.

I'd better suggest the following experiment: please take my
mtprog\examples\example1\main.cpp example and plug your "SMART
synchronized mempool" to see the difference...

Dmitriy V'jukov

unread,

Aug 8, 2008, 6:00:37 AM8/8/08

to

On 8 авг, 13:08, "Sergey P. Derevyago" <non-exist...@iobox.com> wrote:
> Dmitriy V'jukov wrote:
> > On local operations synchronized mempool has precisely the same
> > overheads as single-threaded mempool. Well, Ok, SMART synchronized
> > mempool.
>
> Impossible.
> This time I don't want to cite theoretical stuff why synchronization is
> ALWAYS worse than no synchronization.
>
> I'd better suggest the following experiment: please take my
> mtprog\examples\example1\main.cpp example and plug your "SMART
> synchronized mempool" to see the difference...

There will be no difference, because method alloc() will be the same,
except one additional 'if' on *slow* path; method free() will be the
same exactly; plus one additional method - remote_free(). That is all.
Exactly the same performance as your single-threaded mempool.

Dmitriy V'jukov

Sergey P. Derevyago

unread,

Aug 8, 2008, 6:12:40 AM8/8/08

to

Dmitriy V'jukov wrote:
> Exactly the same performance as your single-threaded mempool.
>

Nevertheless, please perform _real_ measurements.
"In theory, there is no difference between practice and theory. In
practice, there is."

Dmitriy V'jukov

unread,

Aug 8, 2008, 6:57:55 AM8/8/08

to

On 8 авг, 14:12, "Sergey P. Derevyago" <non-exist...@iobox.com> wrote:
> > Exactly the same performance as your single-threaded mempool.
>
> Nevertheless, please perform _real_ measurements.
> "In theory, there is no difference between practice and theory. In
> practice, there is."

I don't want to waste my time... at least this way. Please, can you
make some hint about where I can expect performance degradation? The
only thing I see is one additional 'if' on *slow-path* in alloc()...

Dmitriy V'jukov

Sergey P. Derevyago

unread,

Aug 8, 2008, 7:29:21 AM8/8/08

to

Dmitriy V'jukov wrote:
> I don't want to waste my time... at least this way.
>

So please don't make unverified statements!

> Please, can you
> make some hint about where I can expect performance degradation? The
> only thing I see is one additional 'if' on *slow-path* in alloc()...
>

1. I published my code and the performance/scalability results.
2. There is no your code, the claims only.

So what do you expect?! Please don't make questionable statements!

Giancarlo Niccolai

unread,

Aug 8, 2008, 10:01:02 AM8/8/08

to

Sergey P. Derevyago ha scritto:

The local allocation/memory pool/garbage collector, with object
serialization to pass objects between threads is the main model I used
to write the multithreading module for the Falcon programming language.
Also, I provide some objects not needing memory pool management to share
directly memory and simple data (i.e. numbers, fixed size strings etc.)
between threads.

Unless there is the need to share very complex objects (and usually it
is possible to ask the thread owning the object to update it via
efficient remote-thread calls) the higher efficiency and performance of
the memory allocation and garbage management seems to counterbalance the
higher complexity of data sharing. Of course, some burden is put on the
language user, that needs to know that objects are NOT shared, at best
they are copied, and work around this limit.

However, for a program written ground-up and addressing real-world
problems it seems not to add too much complexity and enforce an
agent-based MT programming model, where every agent is responsible to
keep its data and share INFORMATIONS about its STATE, rather than the
data itself, which helps to keep a cleaner MT environment. I have
written and seen many C/C++ programs that sooner or later ended to
pursue this model and this goals, as the complexity of the MT
interactions raised.

Ok, long words to say that there are full _realworld_ examples in which
you don't need to copy or share much between threads, so I lean for Sergey.

---

OTOH, I've been loosely working to a shared pool model where data is
created in a thread-specific pool, but can be sent around to any other
thread in the program, and then re-collected after each other thread is
done, directly by the pool (and thread) where the object was created.

As soon as I have a bit of free time from our next release I will try to
do go deep in that, so to have the best of both worlds and being able to
share also big, complex objects directly across threads. If someone is
interested, I'd be pleased to enter the details; and we always need
hands, so if someone wish to give a try at this starting from an already
working base, he/she is welcome...

Bests,
Giancarlo Niccolai.

Sergey P. Derevyago

unread,

Aug 8, 2008, 10:26:24 AM8/8/08

to

Giancarlo Niccolai wrote:
> every agent is responsible to
> keep its data and share INFORMATIONS about its STATE, rather than the
> data itself, which helps to keep a cleaner MT environment
>

Concise and clear!
I see we have (almost) the same vision, thank you.

Chris M. Thomasson

unread,

Aug 8, 2008, 3:03:34 PM8/8/08

to

"Giancarlo Niccolai" <gc-re...@removeme-niccolai.cc> wrote in message
news:489c519f$0$11385$5fc...@news.tiscali.it...
[...]

>
> OTOH, I've been loosely working to a shared pool model where data is
> created in a thread-specific pool, but can be sent around to any other
> thread in the program, and then re-collected after each other thread is
> done, directly by the pool (and thread) where the object was created.
>
> As soon as I have a bit of free time from our next release I will try to
> do go deep in that, so to have the best of both worlds and being able to
> share also big, complex objects directly across threads. If someone is
> interested, I'd be pleased to enter the details; and we always need
> hands, so if someone wish to give a try at this starting from an already
> working base, he/she is welcome...

Here is pseudo-code for a memory-allocator that is based on 100%
thread-private pools, however, memory can be freed by any thread:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69

Here some crude, but fully compliable code for a general-purpose region
allocator that is based on a similar algorithm:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html

It uses some x86 assembly which compiles under MSVC++. If you study the code
you will see that local allocations and deallocations are synchronization
free. Remote allocations use a lock-free operation, and attempts to allocate
from an exhausted local pool uses a wait-free operation.

Chris M. Thomasson

unread,

Aug 8, 2008, 3:11:22 PM8/8/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:489c0cf5$0$90269$1472...@news.sunsite.dk...

> Dmitriy V'jukov wrote:
>> On local operations synchronized mempool has precisely the same
>> overheads as single-threaded mempool. Well, Ok, SMART synchronized
>> mempool.
>>
> Impossible.

It is possible. I have done it. Other have as well. BTW, did you know that
Hoard also uses thread-private pools? I would suggest doing some research
into state-of-the-art memory allocation algorihtms. Have you every heard of
StreamFlow? Read this paper:

http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf

Their allocator is just as fast as yours, except its a heck of a lot more
flexible because memory can be freed into any thread. Keep in mind that
local allocations and deallocations are 100% synchronization free. Unlike
yours, you can actually use a producer/consumer pattern which does not
require the overhead of serialization.

> This time I don't want to cite theoretical stuff why synchronization is
> ALWAYS worse than no synchronization.

You seem to be forgetting the conversation we had a couple of years ago. You
agreed that my allocator design makes sense:

http://groups.google.com/group/comp.programming.threads/msg/f5ea5a977a156a03

You already know that its not impossible to create a thread-private
allocator in which memory can be freed into any thread.

> I'd better suggest the following experiment: please take my
> mtprog\examples\example1\main.cpp example and plug your "SMART
> synchronized mempool" to see the difference...

You want to compare one thread-private allocator to another. What's that
going to prove?

Chris M. Thomasson

unread,

Aug 8, 2008, 3:21:19 PM8/8/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:489bef07$0$90276$1472...@news.sunsite.dk...

The point is that I cannot use your system to use objects with a private
copy-ctor in a producer/consumer pattern.

Lets say that I have a server connection object that has buffers and socket,
and pointers to other active connections. When data comes in on the
connections socket I want to pass a pointer of the object to a pool of
consumer threads which act on the received data. If I use your system, each
time data comes in and I need to hand a pointer to the object off to a
consumer, the whole connection object will be serialized into another copy.
So, before you know it, there are multiple copies of the connection object
floating around. I only want ONE connection object per-socket, therefore its
copy-ctor and assignment operators would be private, which means I cannot
use your allocator.

I don't see how you could do efficient producer/consumer pattern if every
production results in a full-blown copy.

Chris M. Thomasson

unread,

Aug 8, 2008, 3:27:51 PM8/8/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:489c2e12$0$90272$1472...@news.sunsite.dk...

> Dmitriy V'jukov wrote:
>> I don't want to waste my time... at least this way.
> >
> So please don't make unverified statements!
>
>> Please, can you
>> make some hint about where I can expect performance degradation? The
>> only thing I see is one additional 'if' on *slow-path* in alloc()...
>>
> 1. I published my code and the performance/scalability results.
> 2. There is no your code, the claims only.
>
> So what do you expect?! Please don't make questionable statements!

You don't seem to get it. The "SMART" memory pool allocator has the exact
same performance as yours. Except its general-purpose (e.g., can be used to
implement malloc/free) and far more flexible. For instance, you can check
the performance results of the StreamFlow allocator:

http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf

also, you can test one of my multi-threaded region allocator algorihtms:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html

This is general-purpose such that it can be used to implement global
new/delete operators. If you step through the code you will quickly notice
that there is absolutely NO synchronization in local allocations, or local
deallocations. The only time any sync is used is when a thread frees memory
that it did not allocate, and when a local allocation hit an exhausted
thread-private pool.

Chris M. Thomasson

unread,

Aug 8, 2008, 8:08:39 PM8/8/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:489bebeb$0$90265$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> Fine. But, as we have discussed before, there is an algorithm in which
>> memory from private thread pool A can be freed on private thread pool B
>> while preserving all the performance benefits of local unsynchronized
>> allocation. Basically, there is no synchronization for thread local
>> allocations and deallocations. However, for remote deallocations there is
>> lock-free operation.
>>
> Lock-free doesn't ALWAYS mean fast, you know.

Sure.

> Could you please give a direct link to "preserving all the performance
> benefits of local unsynchronized allocation"? I have some doubts...

Since I don't think you trust me or Dmitriy:

http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf

Please don't say that smart mem pools are impossible. You just end up making
it seem like you don't have any idea on how state-of-the-art memory
allocators actually work!

Chris M. Thomasson

unread,

Aug 8, 2008, 8:11:41 PM8/8/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:489bebeb$0$90265$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> Fine. But, as we have discussed before, there is an algorithm in which
>> memory from private thread pool A can be freed on private thread pool B
>> while preserving all the performance benefits of local unsynchronized
>> allocation. Basically, there is no synchronization for thread local
>> allocations and deallocations. However, for remote deallocations there is
>> lock-free operation.
>>
> Lock-free doesn't ALWAYS mean fast, you know.
> Could you please give a direct link to "preserving all the performance
> benefits of local unsynchronized allocation"? I have some doubts...

http://groups.google.com/group/comp.programming/msg/bb3c14740615e266

Giancarlo Niccolai

unread,

Aug 9, 2008, 9:56:56 AM8/9/08

to

Chris M. Thomasson ha scritto:

I solved this problem through an object implementing a shared memory
abstraction.

The thread receiving the data creates a thread-local instance of "shared
memory" object, which is more or less a reference counter and a pointer
to the shared memory, which is separately allocated in the global heap.

The receiver thread then fills the shared memory area with the incoming
data. When the object (counter + pointer) is "serialized" and sent to
another thread, the counter is atomically incremented; when it is
"deserialized" and a local instance of the shared memory object is
created, nothing is done. When a local GC frees an instance, the counter
is decremented. When the counter hits 0, the shared memory is freed from
the heap. (Needless to say, the counter is part of the heap/shared memory).

In short, only the "shell" of the data needs to be copied across
threads, while the "core" of the data can be shared.

I have even a facility in the serialization code which detects if the
serialization is "live" or not. Live serialization is meant to live
short and be performed through threads of the same process, so there is
no need to i.e. unroll pointers into streams. It is highly optimized (it
resolves more or less in a flat memcopy of the data).

Indeed, there is an overhead even in this case, as sending a pointer
around is better than sending a copy of a small structure around, and
you need a bit of extra memory, but the simplicity of the allocation and
of the GC process counterbalance this cost.

GN.

Giancarlo Niccolai

unread,

Aug 9, 2008, 9:57:32 AM8/9/08

to

Chris M. Thomasson ha scritto:

> "Giancarlo Niccolai" <gc-re...@removeme-niccolai.cc> wrote in message
> news:489c519f$0$11385$5fc...@news.tiscali.it...
> [...]
>>

> Here is pseudo-code for a memory-allocator that is based on 100%

> thread-private pools, however, memory can be freed by any thread:
>
> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69
>
>
> Here some crude, but fully compliable code for a general-purpose region
> allocator that is based on a similar algorithm:
>
> http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html

Thanks! I will dig into that.

Gian.

Sergey P. Derevyago

unread,

Aug 11, 2008, 4:35:39 AM8/11/08

to

Chris M. Thomasson wrote:
> http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
>
I need some time to review this paper. Wait please...

> Please don't say that smart mem pools are impossible.
>

I'm just asking to plug _real_ "smart mem pool" to my example1 and show
the results.
Is it impossible? Why?

> You just end up
> making it seem like you don't have any idea on how state-of-the-art
> memory allocators actually work!
>

Does ismm06.pdf contain the necessary description?
Something else to review?

Sergey P. Derevyago

unread,

Aug 11, 2008, 4:38:42 AM8/11/08

to

Chris M. Thomasson wrote:
> http://groups.google.com/group/comp.programming/msg/bb3c14740615e266

Could you please plug your algorithm to my example1 and show the results?

Sergey P. Derevyago

unread,

Aug 11, 2008, 4:51:01 AM8/11/08

to

Chris M. Thomasson wrote:
> You don't seem to get it. The "SMART" memory pool allocator has the
> exact same performance as yours.
>

Have you really tested?

> http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
>
I only skimmed over the code and see that your allocate_local() under
some circumstances calls allocate_remote(sz) which imposes
synchronization overhead.
So the question remains: have you _really_ tested?

Sergey P. Derevyago

unread,

Aug 11, 2008, 4:58:32 AM8/11/08

to

Chris M. Thomasson wrote:
> You seem to be forgetting the conversation we had a couple of years ago.
> You agreed that my allocator design makes sense:
>
> http://groups.google.com/group/comp.programming.threads/msg/f5ea5a977a156a03
>
>
> You already know that its not impossible to create a thread-private
> allocator in which memory can be freed into any thread.
>

I need a timeout :)
I'm going to review ismm06.pdf and return to this question, OK?

>> I'd better suggest the following experiment: please take my
>> mtprog\examples\example1\main.cpp example and plug your "SMART
>> synchronized mempool" to see the difference...
>
> You want to compare one thread-private allocator to another. What's that
> going to prove?
>

It does really make sense to compare the map to hash_map performance,
doesn't it?
There are a lot of different allocator implementations, some of them
are inherently slower then others. It always makes sense to perform
_real_ measurements.

Sergey P. Derevyago

unread,

Aug 11, 2008, 5:09:02 AM8/11/08

to

Chris M. Thomasson wrote:
> Lets say that I have a server connection object that has buffers and
> socket, and pointers to other active connections. When data comes in on
> the connections socket I want to pass a pointer of the object to a pool
> of consumer threads which act on the received data. If I use your
> system, each time data comes in and I need to hand a pointer to the
> object off to a consumer, the whole connection object will be serialized
> into another copy.
>

Not, really.
It doesn't make sense to copy if you can make sure that the connection
object is used only by one thread at a time.

If you look at mtprog\mtdirdiff\maintask.hpp you'll see that DirCont
has its own mem_pool:

struct DirCont {
mem_pool ownmp;
sh_text name;
hash_vec<sh_text, char> dirs;
hash_vec<sh_text, unsigned long> files;
// ...
};

The point is that it's used only by one thread at a time but the thread
isn't always the same.
So we see not a thread-private mem_pool but a DATA-private one.

Dmitriy V'jukov

unread,

Aug 11, 2008, 2:59:07 PM8/11/08

to

On Aug 8, 3:29 pm, "Sergey P. Derevyago" <non-exist...@iobox.com>

wrote:
> > I don't want to waste my time... at least this way.
> >
> So please don't make unverified statements!
>
> > Please, can you
> > make some hint about where I can expect performance degradation? The
> > only thing I see is one additional 'if' on *slow-path* in alloc()...
>
> 1. I published my code and the performance/scalability results.
> 2. There is no your code, the claims only.
>
> So what do you expect?! Please don't make questionable statements!

There is really no need for 'my code'. If you will provide hook for
slow-path allocation in your allocator, then you can consider 'your
code' as 'my code'.

Dmitriy V'jukov

Chris M. Thomasson

unread,

Aug 11, 2008, 11:33:37 PM8/11/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:489ffd76$0$90264$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> You don't seem to get it. The "SMART" memory pool allocator has the exact
>> same performance as yours.
> >
> Have you really tested?
>
>> http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
>>
> I only skimmed over the code and see that your allocate_local() under
> some circumstances calls allocate_remote(sz) which imposes synchronization
> overhead.
> So the question remains: have you _really_ tested?

Yes. When you get some more free time, go ahead and look at the actual
_reason_ why the local allocation function defers to the remote one. Its
only because the local thread-private pool is totally exhausted. Therefore,
when a thread-private pool is totally full, my allocator uses
synchronization which has its fairly significantly overheads indeed. IMVHO,
an atomic operation is expensive; period. That's why I admire local
allocation algorithms such as yours. Mine is totally general purpose, and
will use sync ops on extreme boundary conditions; such as thread pool full
condition.

One point, if an application constantly allocates in one thread an frees in
another, well, the sync-condition will be bounded by the depth of the
thread-private memory pool itself.

Chris M. Thomasson

unread,

Aug 11, 2008, 11:37:03 PM8/11/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:2563265c-b93d-4f76...@r15g2000prd.googlegroups.com...

RIGHT! Great point. I totally forgot to mention:

http://groups.google.com/group/comp.arch/browse_frm/thread/6dc825ec9999d3a8

Again, Sergey, when you get free time, please consider the thread linked to
above. Your mem_pool work can be automatically turned into something that is
directly analogous to the `ismm06.pdf' paper.

Chris M. Thomasson

unread,

Aug 11, 2008, 11:39:11 PM8/11/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:489ff9dc$0$90269$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
>>
> I need some time to review this paper. Wait please...
>
>> Please don't say that smart mem pools are impossible.
> >
> I'm just asking to plug _real_ "smart mem pool" to my example1 and show
> the results.
> Is it impossible? Why?

You can plug the following allocator into your algorihtms:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html

because it can overload the global new/delete operators. When I get some
free time, I will go ahead and do so. I think the numbers will be more
reliable of multiple parties join in.

>> You just end up making it seem like you don't have any idea on how
>> state-of-the-art memory allocators actually work!
> >
> Does ismm06.pdf contain the necessary description?
> Something else to review?

AFAICT, the paper explains things quite well indeed.

Chris M. Thomasson

unread,

Aug 11, 2008, 11:40:07 PM8/11/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:489ffa93$0$90276$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> http://groups.google.com/group/comp.programming/msg/bb3c14740615e266
>
> Could you please plug your algorithm to my example1 and show the results?

I will when I get some time. Hopefully, you can expect some results within a
week.

Chris M. Thomasson

unread,

Aug 12, 2008, 12:02:36 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:489ffa93$0$90276$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> http://groups.google.com/group/comp.programming/msg/bb3c14740615e266
>
> Could you please plug your algorithm to my example1 and show the results?

Please excuse my ignorance wrt your library, but how do I build example1 for
a windows platform with mingw?

Chris M. Thomasson

unread,

Aug 12, 2008, 12:23:28 AM8/12/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:ET7ok.20267$QX3....@newsfe02.iad...

I run `mtprog/examples/example1/mogcc.bat' and get the following files:

`mtprog/examples/example1/out.gcc/main1'
`mtprog/examples/example1/out.gcc/main2'

I run:

./main1

and get no output. No feedback. nothing.

I run:

./main2

and get shi%load of errors.

What do I do with those on a windows mingw/cygwin platform? How do I
generate your lib*.a or lib*.so files for `derslib'?

I have cygwin special gcc 3.4.4

and mingw special 3.4.5

Should I just do this on my Solaris or Linux Box?

My main question:

I need the .so and/or .a files for your derslib library. I don't want to
waste any more time.

Chris M. Thomasson

unread,

Aug 12, 2008, 12:44:57 AM8/12/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:ET7ok.20267$QX3....@newsfe02.iad...

Never mind, I got it to compile and link. Anyway, I already found a bug in
your main.cpp file of example1...

You include stdio.h in a cpp program. this is wrong; you need

also, afaict the command line goes like this right;

number_of_threads std_or_ders

where std == 0, and ders == 1

right?

Chris M. Thomasson

unread,

Aug 12, 2008, 12:48:00 AM8/12/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:lv8ok.16681$yn5....@newsfe08.iad...

Okay, again, please excuse my totally ignorance, but a valid command lines
are as follows:

4 std
4 ders

7 std
7 ders

first param is thread count, second is library.

I got it now.

Chris M. Thomasson

unread,

Aug 12, 2008, 12:55:53 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:4899c3e1$0$90276$1472...@news.sunsite.dk...
>* I've written a C++ multithreading tutorial. Here it is if you can read
>Russian: http://ders.stml.net/cpp/mtprog/mtprog.html.
> * The source code is http://ders.stml.net/cpp/mtprog/code.zip and it has
> English comments.
> * Doxygen is http://ders.stml.net/cpp/mtprog/doc/index.html
>
> I have no time to create a full English translation so I'm going to
> describe only several important issues.

[...]

what's going on Sergey? On my P4 3.06gz single-cpu hyper-thread box I get:

an output of

2 8282 std

using a command line of:

2 std

------------------------------------------------------------------------

I get an output of:

2 9656 ders

using a command like of:

2 ders

Yours is definitely slower than the standard on my box.

Will somebody else post their results on similar iron?

I am going to post the results using my one of my region allocators here:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html

within a day or two.

Chris M. Thomasson

unread,

Aug 12, 2008, 1:33:43 AM8/12/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:BF8ok.16685$yn5....@newsfe08.iad...

> "Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message
> news:4899c3e1$0$90276$1472...@news.sunsite.dk...
>>* I've written a C++ multithreading tutorial. Here it is if you can read
>>Russian: http://ders.stml.net/cpp/mtprog/mtprog.html.
>> * The source code is http://ders.stml.net/cpp/mtprog/code.zip and it has
>> English comments.
>> * Doxygen is http://ders.stml.net/cpp/mtprog/doc/index.html
>>
>> I have no time to create a full English translation so I'm going to
>> describe only several important issues.
>
> [...]
>
> what's going on Sergey? On my P4 3.06gz single-cpu hyper-thread box I get:
>
> an output of
>
> 2 8282 std
>
> using a command line of:
>
> 2 std
>
> ------------------------------------------------------------------------
>
> I get an output of:
>
> 2 9656 ders
>
> using a command like of:
>
> 2 ders
>
>
>
> Yours is definitely slower than the standard on my box.

[...]

Yours is on average around 3-5 seconds faster than standard when I crank up
the threads to over 8. I get:

10 44124 std
10 39213 std

Chris M. Thomasson

unread,

Aug 12, 2008, 1:42:07 AM8/12/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:BF8ok.16685$yn5....@newsfe08.iad...

> "Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message
> news:4899c3e1$0$90276$1472...@news.sunsite.dk...
>>* I've written a C++ multithreading tutorial. Here it is if you can read
>>Russian: http://ders.stml.net/cpp/mtprog/mtprog.html.
>> * The source code is http://ders.stml.net/cpp/mtprog/code.zip and it has
>> English comments.
>> * Doxygen is http://ders.stml.net/cpp/mtprog/doc/index.html
>>
>> I have no time to create a full English translation so I'm going to
>> describe only several important issues.
>
> [...]
>
> what's going on Sergey? On my P4 3.06gz single-cpu hyper-thread box I get:

[...]

> I am going to post the results using my one of my region allocators here:
>
> http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
>
>
> within a day or two.

You example1 test application for normal standard list is:

const int N=1000;
const int M=10000;

void start_std(void*)
{
list<int> lst;
for (int i=0; i<N; i++) {
for (int j=0; j<M; j++) lst.push_back(j);
for (int j=0; j<M; j++) lst.pop_front();
}
}

where start_srd is the thread entry function right:

Chris M. Thomasson

unread,

Aug 12, 2008, 1:59:49 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:489ffa93$0$90276$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> http://groups.google.com/group/comp.programming/msg/bb3c14740615e266
>
> Could you please plug your algorithm to my example1 and show the results?

Your not going to like them... Here is the code compliable with MSVC 7.1 or
higher:

http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html

I plug my multi-threading allocator in your test code as:

const int N=1000;
const int M=10000;

class sergey_vzmem_thread : public thread_base {
void on_entry() {
std::list<int> lst;

for (int i=0; i<N; i++) {
for (int j=0; j<M; j++) lst.push_back(j);
for (int j=0; j<M; j++) lst.pop_front();
}
}

};

I get on release build for 2 threads an output of:

(1) - test running
(1) - test finished

test time: 2875 ms
--------------------

With your code on mingw, for 2 threads I get an output of:

2 8282 std

With my allocator and 10 threads I get an output of:

(1) - test running
(1) - test finished

test time: 14563 ms
--------------------

you code with 10 threads is:

10 39213 ders

My smart multi-threading region allocator kills yours dead; Sorry.

Chris M. Thomasson

unread,

Aug 12, 2008, 2:00:58 AM8/12/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:3d9ok.17697$1N1...@newsfe07.iad...

OOPS! the last one was suppose to read as:

10 39213 ders

Chris M. Thomasson

unread,

Aug 12, 2008, 2:04:50 AM8/12/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:xB9ok.17710$1N1....@newsfe07.iad...

> "Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message
> news:489ffa93$0$90276$1472...@news.sunsite.dk...
>> Chris M. Thomasson wrote:
>>> http://groups.google.com/group/comp.programming/msg/bb3c14740615e266
>>
>> Could you please plug your algorithm to my example1 and show the results?
>
> Your not going to like them... Here is the code compliable with MSVC 7.1
> or higher:
>
> http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html
>
> I plug my multi-threading allocator in your test code as:
>
> const int N=1000;
> const int M=10000;
>
> class sergey_vzmem_thread : public thread_base {
> void on_entry() {
> std::list<int> lst;
> for (int i=0; i<N; i++) {
> for (int j=0; j<M; j++) lst.push_back(j);
> for (int j=0; j<M; j++) lst.pop_front();
> }
> }
> };
>
>
>
> I get on release build for 2 threads an output of:
>
>
> (1) - test running
> (1) - test finished
>
>
> test time: 2875 ms
> --------------------
>
>
>
>
>
>
> With your code on mingw, for 2 threads I get an output of:
>
> 2 8282 std

Well, I make cut and paste mistake. Okay, I need to create fresh thread, and
post results.

The mistake I made was on the 2 thread version of your run. I accidentally
ran the stl version (e.t., cmd-line of 2 std, instead of 2 ders). The output
was actually:

2 9656 ders

I will entitle the thread:

Sergey P. Derevyago -VS- vzmem...

Sergey P. Derevyago

unread,

Aug 12, 2008, 4:24:52 AM8/12/08

to

Chris M. Thomasson wrote:
> You can plug the following allocator into your algorihtms:
> http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
>

Unfortunately it requires i686-specific compilers.

Sergey P. Derevyago

unread,

Aug 12, 2008, 4:26:49 AM8/12/08

to

Chris M. Thomasson wrote:
> Please excuse my ignorance wrt your library, but how do I build example1
> for a windows platform with mingw?
>

Please look at mdgcc.bat and mogcc.bat files.

Sergey P. Derevyago

unread,

Aug 12, 2008, 4:32:42 AM8/12/08

to

Chris M. Thomasson wrote:
> My main question:
> I need the .so and/or .a files for your derslib library. I don't want to
> waste any more time.
>

There is no .so and/or .a files, sorry.
The point is that it's a TUTORIAL, so the reader is expected just to
run simple .bat file rather then install and tune specific build systems.

Sergey P. Derevyago

unread,

Aug 12, 2008, 4:39:32 AM8/12/08

to

Chris M. Thomasson wrote:
> Never mind, I got it to compile and link. Anyway, I already found a bug
> in your main.cpp file of example1...
>

Great! :)

> You include stdio.h in a cpp program. this is wrong; you need
> <cstdio>
>

Not, really.
<stdio.h> is really legal: no one is going to ban POSIX.

> also, afaict the command line goes like this right;
>
> number_of_threads std_or_ders
>
> where std == 0, and ders == 1
>

@out.gcc\main 1 std
@out.gcc\main 2 std
@out.gcc\main 4 std
@out.gcc\main 8 std
@out.gcc\main 16 std
@out.gcc\main 32 std
@out.gcc\main 64 std
@out.gcc\main 1 ders
@out.gcc\main 2 ders
@out.gcc\main 4 ders
@out.gcc\main 8 ders
@out.gcc\main 16 ders
@out.gcc\main 32 ders
@out.gcc\main 64 ders

Sergey P. Derevyago

unread,

Aug 12, 2008, 4:43:40 AM8/12/08

to

Chris M. Thomasson wrote:
> My smart multi-threading region allocator kills yours dead; Sorry.
>

Very good :)
But the test isn't quite fair, I'll send the details later...

Sergey P. Derevyago

unread,

Aug 12, 2008, 4:53:48 AM8/12/08

to

Chris M. Thomasson wrote:
> Again, Sergey, when you get free time, please consider the thread linked
> to above. Your mem_pool work can be automatically turned into something
> that is directly analogous to the `ismm06.pdf' paper.
>

Yes, this is the root of confusion!

My mem_pool is surely built on standard new/delete:

if (size<=mp_impl::MOS1) {
void*& head=IMPL->heads1[(size-1)/mp_impl::SZVP];
if (!head) IMPL->format_new_chunk(head, size, mp_impl::SZVP);

ret=head;
head=*reinterpret_cast<void**>(head);
}
else if (size<=mp_impl::MOS2) {
void*& head=IMPL->heads2[(size-mp_impl::MOS1-1)/mp_impl::SZVP8];
if (!head) IMPL->format_new_chunk(head, size, mp_impl::SZVP8);

ret=head;
head=*reinterpret_cast<void**>(head);
}
else ret=operator new(size); // <-------------------------------------

But your code replaces new/delete rather then mem_pool. So you should
compare
mem_pool+your_pool with your_pool
rather than
mem_pool+new/delete with your_pool.

What I wanted to say is: you should use thread-private mem_pool over
some other general-purpose memory allocator because a bare
general-purpose memory allocator is always slower!

Sergey P. Derevyago

unread,

Aug 12, 2008, 4:56:06 AM8/12/08

to

Chris M. Thomasson wrote:
> what's going on Sergey? On my P4 3.06gz single-cpu hyper-thread box I get:
>
> an output of
>
> 2 8282 std
>
> using a command line of:
>
> 2 std
>
> ------------------------------------------------------------------------
>
> I get an output of:
>
> 2 9656 ders
>
> using a command like of:
>
> 2 ders
>
>
>
> Yours is definitely slower than the standard on my box.
>

Have you disabled assertions with -DNDEBUG?

Chris M. Thomasson

unread,

Aug 12, 2008, 5:52:27 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:48a148d5$0$90263$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> You can plug the following allocator into your algorihtms:
>> http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
>>
> Unfortunately it requires i686-specific compilers.

Yup. That's a major caveat indeed!

Chris M. Thomasson

unread,

Aug 12, 2008, 5:55:49 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:48a14d3d$0$90270$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> My smart multi-threading region allocator kills yours dead; Sorry.
> >
> Very good :)
> But the test isn't quite fair, I'll send the details later...

Well, my allocator seems to beat yours on my specific platform of P4 3.06GHZ
512mb. That's not conclusive. What I need to do is get my allocator running
under a portable atomic operations API. I think I will go with the HP
atomic_ops package:

http://www.hpl.hp.com/research/linux/atomic_ops/

Its portable to various "popular" platforms, and will allow others to runs
my allocator who do not have x86.

Chris M. Thomasson

unread,

Aug 12, 2008, 5:58:32 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:48a15026$0$90271$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> what's going on Sergey? On my P4 3.06gz single-cpu hyper-thread box I
>> get:
>>
>> an output of
>>
>> 2 8282 std
>>
>> using a command line of:
>>
>> 2 std
>>
>> ------------------------------------------------------------------------
>>
>> I get an output of:
>>
>> 2 9656 ders
>>
>> using a command like of:
>>
>> 2 ders
>>
>>
>>
>> Yours is definitely slower than the standard on my box.
>>
>
> Have you disabled assertions with -DNDEBUG?

http://groups.google.com/group/comp.programming.threads/msg/ad1b3f703c748892

Chris M. Thomasson

unread,

Aug 12, 2008, 6:09:13 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:489fff39$0$90270$1472...@news.sunsite.dk...
> Chris M. Thomasson wrote:
>> You seem to be forgetting the conversation we had a couple of years ago.
>> You agreed that my allocator design makes sense:
>>
>> http://groups.google.com/group/comp.programming.threads/msg/f5ea5a977a156a03
>> You already know that its not impossible to create a thread-private
>> allocator in which memory can be freed into any thread.
>>
> I need a timeout :)
> I'm going to review ismm06.pdf and return to this question, OK?
>
>>> I'd better suggest the following experiment: please take my
>>> mtprog\examples\example1\main.cpp example and plug your "SMART
>>> synchronized mempool" to see the difference...
>>
>> You want to compare one thread-private allocator to another. What's that
>> going to prove?
> >
> It does really make sense to compare the map to hash_map performance,
> doesn't it?
> There are a lot of different allocator implementations, some of them are
> inherently slower then others. It always makes sense to perform _real_
> measurements.

I am interested in measuring general-purpose against your special-purpose
algorithm.

Chris M. Thomasson

unread,

Aug 12, 2008, 6:18:12 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:48a14f9d$0$90271$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> Again, Sergey, when you get free time, please consider the thread linked
>> to above. Your mem_pool work can be automatically turned into something
>> that is directly analogous to the `ismm06.pdf' paper.
>>
> Yes, this is the root of confusion!
>
> My mem_pool is surely built on standard new/delete:
>
> if (size<=mp_impl::MOS1) {
> void*& head=IMPL->heads1[(size-1)/mp_impl::SZVP];
> if (!head) IMPL->format_new_chunk(head, size, mp_impl::SZVP);
>
> ret=head;
> head=*reinterpret_cast<void**>(head);
> }
> else if (size<=mp_impl::MOS2) {
> void*& head=IMPL->heads2[(size-mp_impl::MOS1-1)/mp_impl::SZVP8];
> if (!head) IMPL->format_new_chunk(head, size, mp_impl::SZVP8);
>
> ret=head;
> head=*reinterpret_cast<void**>(head);
> }
> else ret=operator new(size); // <-------------------------------------
>
> But your code replaces new/delete rather then mem_pool. So you should
> compare
> mem_pool+your_pool with your_pool
> rather than
> mem_pool+new/delete with your_pool.

You mean allow your mem_pool to use my region allocator as a backing store?

> What I wanted to say is: you should use thread-private mem_pool over some
> other general-purpose memory allocator because a bare general-purpose
> memory allocator is always slower!

Anyway, on Dmitriy's box, my general-purpose algorithm is around 100ms
slower for 4 threads, and about 1 second slower for 16 threads. Pretty good
for bare general-purpose vs. specialized algorithm:

http://groups.google.com/group/comp.programming.threads/msg/cdb7a19bea94c3e9

Dmitriy V'jukov

unread,

Aug 12, 2008, 6:54:42 AM8/12/08

to

On 6 авг, 21:08, "Sergey P. Derevyago" <non-exist...@iobox.com> wrote:
> Dmitriy V'jukov wrote:
> > You conclude that Hyper-Threaded processor is worse than non Hyper-
> > Threaded processor.
>
> Not exact.
> From the _scalability_ POV one HT CPU is worse then one non-HT CPU for
> _this kind of tasks._
>
> > Have you applied optimization techniques described
> > in "Intel 64 and IA-32 Architectures Optimization Reference Manual"?
>
> There was no point to explicitly code for HT processors.
> HT quirks are out of scope.
> Nevertheless, please feel free to apply any optimizations you want and
> report the results...

You benchmark on SMP system, and you especially optimize for SMP
system (this is the main purpose of the allocator). But you also
benchmark on HT system, but didn't optimize for it. Well, for me it
looks quite dishonestly. To get honest results and make honest
conclusions you must either optimize for HT too or exclude it from
test. Imho.

Dmitriy V'jukov

Chris M. Thomasson

unread,

Aug 12, 2008, 7:11:09 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:4899c3e1$0$90276$1472...@news.sunsite.dk...
>* I've written a C++ multithreading tutorial. Here it is if you can read
>Russian: http://ders.stml.net/cpp/mtprog/mtprog.html.
> * The source code is http://ders.stml.net/cpp/mtprog/code.zip and it has
> English comments.
> * Doxygen is http://ders.stml.net/cpp/mtprog/doc/index.html
>
> I have no time to create a full English translation so I'm going to
> describe only several important issues.
[...]

Do you have an alignment issue with your code? Apparently, you do:

http://groups.google.com/group/comp.programming.threads/msg/2143c3a4344f25a7

I have not confirmed this yet. Actually, I only very briefly scan your code,
and did not bother to look at alignment because I figured that was a given
such that any allocator code would explicitly handle it because it such an
important issue indeed. If you code does align memory correctly, please
point out that moment. Otherwise, you really should fix this error. It will
slow the performance down just a little bit, but unfortunately that's a
price to pay for correctness...

Sergey P. Derevyago

unread,

Aug 12, 2008, 7:11:00 AM8/12/08

to

Dmitriy V'jukov wrote:
> You benchmark on SMP system, and you especially optimize for SMP
> system (this is the main purpose of the allocator). But you also
> benchmark on HT system, but didn't optimize for it.
>

It's not the case, sorry. I've written a C++ multithreading TUTORIAL.

A tutorial with a "one size fits all" example that allows for
simultaneous execution.

That's it! There was no point to explicitly optimize for HT...

Chris M. Thomasson

unread,

Aug 12, 2008, 7:24:43 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:48a16fc4$0$90272$1472...@news.sunsite.dk...

> Dmitriy V'jukov wrote:
>> You benchmark on SMP system, and you especially optimize for SMP
>> system (this is the main purpose of the allocator). But you also
>> benchmark on HT system, but didn't optimize for it.
> >
> It's not the case, sorry. I've written a C++ multithreading TUTORIAL.
>
> A tutorial with a "one size fits all" example that allows for simultaneous
> execution.
>
> That's it! There was no point to explicitly optimize for HT...

Here is something that could help improve HT performance in your benchmarks:

http://softwarecommunity.intel.com/Wiki/Multi-threadappsforMulti-core/487.htm

Intel had nasty bug in early HT processors. You can fix it by assigning a
thread a monotonically increasing id and multiplying that by 64 and then use
that in a call to alloca. Here is some more on that:

http://groups.google.com/group/comp.programming.threads/msg/b1a5cdeb5ea55a05

Sergey P. Derevyago

unread,

Aug 20, 2008, 6:46:26 AM8/20/08

to

Sergey P. Derevyago wrote:
> Chris M. Thomasson wrote:

>> http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
>>
> I need some time to review this paper. Wait please...
>
Well, the paper has been reviewed. The main impression is that it's of
average quality, nothing interesting.
The presented Streamflow allocator has no chance to serve as a _general
purpose_ C++ allocator: it
"is implemented on top of Linux 2.6, which provides support for
superpages via a virtual filesystem. Streamflow allocates superpages by
creating files in the virtual filesystem, and mapping these files in
whole or in part to virtual memory."

On the contrary, my mem_pool doesn't relie on any OS specific features
and can be used in ordinary single-threaded environment without any
penalty. There is no thread-specific overhead in the code!

So the main points are:
1. Design your application to allow for simultaneous execution. I.e.
don't allocate/free the memory across the thread bounds (if you can't
prove that it really makes sense _for this particular task_).
2. Then use mem_pool as a thread-private (sometimes, data-private)
allocator to greatly improve the performance.

In essence, non-synchronized mem_pool allocator is built on top of
ordinary new/delete synchronized operator so the combination of
mem_pool+new is really used. On the contrary, Streamflow tries to
replace not the mem_pool allocator itself by the whole mem_pool+new
combination!
So if you're going to compare mem_pool to some other allocator please
make sure that it uses the same synchronized "new part" of the
mem_pool+new combination. I.e. it's not quite fair to compare
mem_pool+"old bad new" with heavily optimized SuperPuperAllocator: you
should cut the appropriate synchronized part of SuperPuperAllocator and
plug it to mem_pool instead of "old bad new" to see the difference
mem_pool can give.

Chris M. Thomasson

unread,

Aug 21, 2008, 7:10:59 AM8/21/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:48abf603$0$90275$1472...@news.sunsite.dk...

> Sergey P. Derevyago wrote:
>> Chris M. Thomasson wrote:
>>> http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
>>>
>> I need some time to review this paper. Wait please...
>>
> Well, the paper has been reviewed. The main impression is that it's of
> average quality, nothing interesting.
> The presented Streamflow allocator has no chance to serve as a _general
> purpose_ C++ allocator:

ROFL!

Ummm, you can use StreamFlow to replace new/delete, just like vzmem region
allocator. You need to actually understand the paper. Your statement seems
to suggest that you don't know what the heck your talking about! Your brain
can't seem to be able to wrap its mind around the fact that general-purpose
allocators can be build with per-thread memory pools...

I have had extensive conversations with the implementors of StreamFlow, and
can tell you to either talk to them, learn about it, or stop spewing
non-sense! Learn something for once! Your allocator is fine. However, there
are general-purpose allocators that are just as fast and OH SO MUCH MORE
FLEXIBLE!

If you actually read the paper, you would realize that StreamFlow works on
platforms with a certain version of CAS... What type of CAS does it need?
Don't you dare think is needs pointers in that CAS. Give me an answer and
prove you actually understand what you read!!!!!!!

;^/

[...]

Chris M. Thomasson

unread,

Aug 21, 2008, 7:14:47 AM8/21/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message
news:48abf603$0$90275$1472...@news.sunsite.dk...

> Sergey P. Derevyago wrote:
>> Chris M. Thomasson wrote:
>>> http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
>>>
>> I need some time to review this paper. Wait please...
>>
> Well, the paper has been reviewed. The main impression is that it's of
> average quality, nothing interesting.
> The presented Streamflow allocator has no chance to serve as a _general
> purpose_ C++ allocator: it
> "is implemented on top of Linux 2.6, which provides support for superpages
> via a virtual filesystem. Streamflow allocates superpages by creating
> files in the virtual filesystem, and mapping these files in whole or in
> part to virtual memory."

[...]

I have implemented StreamFlow algorithm for fun under with Windows, and
Solaris (x86 and SPARC, x86-32 and SPARC 32/64 versions). You don't know
what your writing about. If I can read your mind, I think you seem to be
under the impression that you can't represent a superblock in their
lock-free algorithm? On the contrary, it makes it far easier because you can
use alignment trick.

Sorry for being so harsh.

Chris M. Thomasson

unread,

Aug 21, 2008, 7:29:01 AM8/21/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:m2crk.2596$4s1....@newsfe06.iad...

Well, the fact that several general-purpose allocators (Hoard, StreamFlow,
vzoom slab, vzoom region, ect...) exist which are based on per-thread memory
allocation schemes seems to blow your entire theory out the water; right? I
am really sick and tired of hearing that general purpose allocation is
non-scaleable crap because its simply NOT TRUE! You are totally misleading
everybody who reads your writings; that is NOT cool in any way shape or
form. You fail to point to any state-of-the-art memory allocation algorihtms
even though you knew about them as far back as 2006! That is not very
professional. Read here; indeed we have discussed this before a couple of
years ago:

http://groups.google.com/group/comp.programming.threads/msg/bc019dc04362be41

http://groups.google.com/group/comp.programming.threads/msg/1d3aeaa6ebf5181f

http://groups.google.com/group/comp.programming.threads/msg/907573c0e81dba56

http://groups.google.com/group/comp.programming.threads/msg/f5ea5a977a156a03
(I finally got you to agree that my design makes sense... ;^)

http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855

How does your theory that general-purpose memory allocation is crap and of
not really useful in multi-threaded applications bind with the contents of
the links above? Also, ever heard of Hoard?

:^|

Sergey P. Derevyago

unread,

Aug 21, 2008, 7:44:14 AM8/21/08

to

No communication after this point!

--

Sergey P. Derevyago

unread,

Aug 21, 2008, 7:53:07 AM8/21/08

to

Sergey P. Derevyago wrote:
> Well, the paper has been reviewed. The main impression is that it's
> of average quality, nothing interesting.
>

One can donwload the paper and look at the graphs and tables at pp.
9-10: Streamflow doesn't show exceptional quality.

> The presented Streamflow allocator has no chance to serve as a
> _general purpose_ C++ allocator: it
>

The point is that well-designed MT application with the combination of
"mem_pool+new" will always beat "just new" for any possible "new".

Chris M. Thomasson

unread,

Aug 21, 2008, 7:59:42 AM8/21/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:48ad550f$0$90264$1472...@news.sunsite.dk...

You make a totally ridiculous claim about StreamFlow. Period.

Chris M. Thomasson

unread,

Aug 21, 2008, 8:14:04 AM8/21/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:48ad5724$0$90272$1472...@news.sunsite.dk...

> Sergey P. Derevyago wrote:
>> Well, the paper has been reviewed. The main impression is that it's
>> of average quality, nothing interesting.
> >
> One can donwload the paper and look at the graphs and tables at pp. 9-10:
> Streamflow doesn't show exceptional quality.

You don't like its performance; fine. Well, what about Hoard? It uses
per-thread heaps as well. The fact is that your statement is totally bogus
wrt comparing it to your strictly single-threaded scheme because its
benchmarks compares it to other multi-threaded allocators. You cannot even
begin to think about comparing your allocator to StreamFlow benchmarks, its
flexible enough to be used as a replacement for malloc/free/new/delete and
yours is simply not. Yet if you used StreamFlow, Hoard, vzoom slab or vzoom
region in a strictly single-threaded application, well, the performance
between the general-purpose allocators and yours is going to be equal, or
perhaps yours might be 50-100 milliseconds faster. Big deal. We live in a MT
world today and single-thread only allocators are fine, but they are not
going to take off. How can I plug your allocator into an existing
application? Sorry, I cannot because it assumes the allocator sub-system
knows about the existence of threading.

>> The presented Streamflow allocator has no chance to serve as a
>> _general purpose_ C++ allocator: it
> >
> The point is that well-designed MT application with the combination of
> "mem_pool+new" will always beat "just new" for any possible "new".

You totally miss the point! Let me correct your misleading statement:

Sergey simply seems to refuse to admit that a "well-designed" MT application
used in clever conjunction with a general-purpose multi-threaded allocator
which is based on per-thread memory pools will usually beat one that is
using a poorly designed multi-threaded allocator.

Also, you seem to fail to consistently mention that your scheme has several
caveats when you compare it to multi-threaded allocator... Humm, if a
well-designed MT application uses a producer/consumer pattern, well, its
going to have to serialize every single production/consumption pair. This
can be a serious performance penalty indeed.

I advise you to quite misleading people who don't know any better. Your not
being intellectually honest to say the least. I think you know better!

:^|

Chris M. Thomasson

unread,

Aug 21, 2008, 8:38:39 AM8/21/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:tIcrk.2611$4s1....@newsfe06.iad...

Your allocator works fine, and is a wonderful item for any programmer to
have in their tool box indeed. The only thing that irks me is when you
continue to claim that general-purpose allocation is dog slow.

David Thompson

unread,

Aug 24, 2008, 11:01:28 PM8/24/08

to

On Tue, 12 Aug 2008 11:39:32 +0300, "Sergey P. Derevyago"
<non-ex...@iobox.com> wrote:

> Chris M. Thomasson wrote:

> > You include stdio.h in a cpp program. this is wrong; you need
> > <cstdio>
> >
> Not, really.
> <stdio.h> is really legal: no one is going to ban POSIX.
>

POSIX isn't the issue; stdio is standard-C and standard-C++.
Things like <fcntl.h> are (only) POSIX.

For new-in-C++ features standard-C++ has only the namespace-std::
versions e.g. <vector> not the pre-standard <vector.h>. But for the
parts adopted from C it has both the namespace-std:: version e.g.
<cstdio> and the 'global' version <stdio.h>. The latter, because they
don't provide the benefits of namespace segregation, are in general
considered poorer style, but they are absolutely standard.

Of course this isn't really relevant to threading.

- formerly david.thompson1 || achar(64) || worldnet.att.net

C++ multithreading: mem_pool

Sergey P. Derevyago

Dmitriy V'jukov

Sergey P. Derevyago

Chris M. Thomasson

Sergey P. Derevyago

Chris M. Thomasson

Sergey P. Derevyago

Chris M. Thomasson

Chris M. Thomasson

Sergey P. Derevyago

Sergey P. Derevyago

Dmitriy V'jukov

Sergey P. Derevyago

Dmitriy V'jukov

Sergey P. Derevyago

Dmitriy V'jukov

Sergey P. Derevyago

Giancarlo Niccolai

Sergey P. Derevyago

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Giancarlo Niccolai

Giancarlo Niccolai

Sergey P. Derevyago

Sergey P. Derevyago

Sergey P. Derevyago

Sergey P. Derevyago

Sergey P. Derevyago

Dmitriy V'jukov

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Sergey P. Derevyago

Sergey P. Derevyago

Sergey P. Derevyago

Sergey P. Derevyago

Sergey P. Derevyago

Sergey P. Derevyago

Sergey P. Derevyago

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Dmitriy V'jukov

Chris M. Thomasson

Sergey P. Derevyago

Chris M. Thomasson

Sergey P. Derevyago

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Sergey P. Derevyago

Sergey P. Derevyago

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

David Thompson