Sergey P. Derevyago -VS- vzmem...

Chris M. Thomasson

unread,

Aug 12, 2008, 2:20:22 AM8/12/08

to

Context:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7a67bc70d425ca23
(READ ALL!)

Sergey's single-threaded mem_pool code is here:

http://ders.stml.net/cpp/mtprog/mtprog.html
(website)

http://ders.stml.net/cpp/mtprog/code.zip
(code)

http://ders.stml.net/cpp/mtprog/doc/index.html
(doxygen)

His benchmark code is as follows:

void start_std(void*)
{
list<int> lst;
for (int i=0; i<N; i++) {
for (int j=0; j<M; j++) lst.push_back(j);
for (int j=0; j<M; j++) lst.pop_front();
}
}

simple enough. He challenged me to post results from my multi-threaded
general-purpose allocator using the same benchmark code. Here is my
response:

http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html

The platform i used for the tests is a P4 3.06mhz HyperThread 512mb...

For a simple test with 2 threads (e.g., cmd-line<2 ders>), here is the
output I get with MINGW GCC 3.4.5 with an optimization level of -O3:

2 9656 ders

The second column is the number of milli-seconds the test ran.

Here is the output I get from running my code with 2 threads (e.g.,
THREAD_COUNT macro defined to a value of 2) complied with MSVC 8.0 release
build:

(1) - test running
(1) - test finished

test time: 2875 ms
--------------------

For a test with 10 threads (e.g., cmd-line<10 ders>), Sergey's code gives
me:

10 39213 ders

Mine is (e.g., THREAD_COUNT macro defined to a value of 10):

(1) - test running
(1) - test finished

test time: 14563 ms
--------------------

Just recap:

2 threads: Sergey(9656) -vs- Chris(2875) milliseconds

10 threads: Sergey(39213) -vs- Chris(14563) milliseconds

This does not prove anything! I urge others to run the benchmarks an post
the results here please!

Thank you all.

Chris M. Thomasson

unread,

Aug 12, 2008, 2:25:09 AM8/12/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:OU9ok.17713$1N1....@newsfe07.iad...
[...]

> The platform i used for the tests is a P4 3.06mhz HyperThread 512mb...

P4 3.06-GHZ Hyper-Thread 512-MB

!

[...]

Sergey P. Derevyago

unread,

Aug 12, 2008, 5:08:25 AM8/12/08

to

Chris M. Thomasson wrote:
> For a simple test with 2 threads (e.g., cmd-line<2 ders>), here is the
> output I get with MINGW GCC 3.4.5 with an optimization level of -O3:
>

Have you disabled assertions with -DNDEBUG?

--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net

Dmitriy V'jukov

unread,

Aug 12, 2008, 5:46:05 AM8/12/08

to

On 12 авг, 10:20, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> 2 threads: Sergey(9656) -vs- Chris(2875) milliseconds
>
> 10 threads: Sergey(39213) -vs- Chris(14563) milliseconds
>
> This does not prove anything! I urge others to run the benchmarks an post
> the results here please!

I compile both with MSVC8 /Ox. Hardware is Intel Core 2 Quad 2.4GHz.
And set N=10000, M=10000.

4 threads:
std 254600
ders 2578
vzoom 2671 (std with defined THREAD_MALLOC_OVERLOAD_NEW_DELETE)

16 threads:
std - don't even want to run
ders 9500
vzoom 10515

Dmitriy V'jukov

Chris M. Thomasson

unread,

Aug 12, 2008, 5:52:00 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message
news:48a15309$0$90269$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> For a simple test with 2 threads (e.g., cmd-line<2 ders>), here is the
>> output I get with MINGW GCC 3.4.5 with an optimization level of -O3:
>>
>
> Have you disabled assertions with -DNDEBUG?

Yeah. My code needs to have NDEBUG defined, or it performs an atomic
operation on every allocation and deallocation which would tank its
performance. Anyway, I need to see some numbers generated by others before I
can come to any concrete conclusions.

Chris M. Thomasson

unread,

Aug 12, 2008, 6:08:00 AM8/12/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:1cb073c7-ab81-4552...@f36g2000hsa.googlegroups.com...

Great! Now that's more like it. I was wondering why my region allocator was
beating his memory allocation algorithm. Perhaps it because region
allocators do not like certain usage patterns at all. I took at look at his
algorithm, and I concluded that region allocation should perform slower.
Also, I was compiling his code with MINGW/CYGWIN 3.4.5/3.4.4, that could
make a big difference.

Anyway, could you run test with 2 threads and 10 threads? Also, try a
single-threaded run. And make sure to define NDEBUG.

I would love to run Sergey's code through the benchmark code provided in the
following link:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html

but I cannot because it requires the ability of allocating in one thread and
freeing in another...

;^(...

Chris M. Thomasson

unread,

Aug 12, 2008, 6:22:24 AM8/12/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:bedok.17757$1N1....@newsfe07.iad...

> "Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
> news:1cb073c7-ab81-4552...@f36g2000hsa.googlegroups.com...
> On 12 авг, 10:20, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>
>> > 2 threads: Sergey(9656) -vs- Chris(2875) milliseconds
>> >
>> > 10 threads: Sergey(39213) -vs- Chris(14563) milliseconds
>> >
>> > This does not prove anything! I urge others to run the benchmarks an
>> > post
>> > the results here please!
>
>
>> I compile both with MSVC8 /Ox. Hardware is Intel Core 2 Quad 2.4GHz.
>> And set N=10000, M=10000.
>
>> 4 threads:
>> std 254600
>> ders 2578
>> vzoom 2671 (std with defined THREAD_MALLOC_OVERLOAD_NEW_DELETE)
>
>> 16 threads:
>> std - don't even want to run
>> ders 9500
>> vzoom 10515
>
> Great! Now that's more like it.

> I was wondering why my region allocator was beating his memory allocation
> algorithm.

Need to postfix the following to the sentence above:

`on the platform I was running the tests on.'

[...]

Chris M. Thomasson

unread,

Aug 12, 2008, 6:28:23 AM8/12/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:1cb073c7-ab81-4552...@f36g2000hsa.googlegroups.com...

On 12 авг, 10:20, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > 2 threads: Sergey(9656) -vs- Chris(2875) milliseconds
> >
> > 10 threads: Sergey(39213) -vs- Chris(14563) milliseconds
> >
> > This does not prove anything! I urge others to run the benchmarks an
> > post
> > the results here please!

> I compile both with MSVC8 /Ox. Hardware is Intel Core 2 Quad 2.4GHz.
> And set N=10000, M=10000.

[...]

One other request Dimitriy; In the follow section of my code:

/* Global Settings
_____________________________________________________________*/
#define THREAD_MALLOC_OVERLOAD_NEW_DELETE

#if ! defined(THREAD_ALLOCATOR_HEAP_SIZE)
# define THREAD_ALLOCATOR_HEAP_SIZE 262144
#endif
#if ! defined(THREAD_ALLOCATOR_DTOR_MARK)
# define THREAD_ALLOCATOR_DTOR_MARK 0x80000000
#endif
#if ! defined(THREAD_ALLOCATOR_PRIME_COUNT)
# define THREAD_ALLOCATOR_PRIME_COUNT 1
#endif
#if ! defined(THREAD_ALLOCATOR_MAX_COUNT)
# define THREAD_ALLOCATOR_MAX_COUNT 2
#endif

Try increasing the prime count by 2 or 3, and increasing the max count by 3
or 4; just make sure the max count is a little bit larger than the prime
count. See if that effects the testing at all. Also, you could try
increasing the head size a bit. I am interested to see how it changes things
for better or worse.

Dmitriy V'jukov

unread,

Aug 12, 2008, 6:25:29 AM8/12/08

to

On 12 авг, 13:46, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> > 2 threads: Sergey(9656) -vs- Chris(2875) milliseconds
>
> > 10 threads: Sergey(39213) -vs- Chris(14563) milliseconds
>
> > This does not prove anything! I urge others to run the benchmarks an post
> > the results here please!
>
> I compile both with MSVC8 /Ox. Hardware is Intel Core 2 Quad 2.4GHz.
> And set N=10000, M=10000.
>
> 4 threads:
> std 254600
> ders 2578
> vzoom 2671 (std with defined THREAD_MALLOC_OVERLOAD_NEW_DELETE)
>
> 16 threads:
> std - don't even want to run
> ders 9500
> vzoom 10515

Well, it seems that comparison was not very honest.
First, by concatenation of circumstances ders allocator was using
vzoom allocator as underlying allocator. Now I turn off vzoom
allocator on ders test.
Second, I create stl compatible allocator for vzoom.
Third, I turn off debug checks wrt sturtup in vzoom allocator.
Also I notice 2 errors in ders allocator. First, it doesn't align
memory! It just returns random addresses. Second, it returns NULL,
when I allocate 0 bytes. I don't fix them.

Here is new results:
4 threads:
ders 2531
vzoom 2437

16 threads:
ders 9406
vzoom 9484

They looks roughly equal. I think that alignment support will slow-
down ders a bit. And note that vzoom is MULTI-THREADED allocator,
while ders is single-threaded.
More important is that vzoom approach can turn ders allocator into
multi-threaded allocator PRESERVING it's single-threaded performance.

Dmitriy V'jukov

unread,

Aug 12, 2008, 6:33:43 AM8/12/08

to

On 12 авг, 14:28, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message
>
> news:1cb073c7-ab81-4552...@f36g2000hsa.googlegroups.com...

I set:
# define THREAD_ALLOCATOR_HEAP_SIZE (262144*4)
# define THREAD_ALLOCATOR_PRIME_COUNT (1+3)
# define THREAD_ALLOCATOR_MAX_COUNT (2+4)

Now results:
4 threads:
vzoom 2390

16 threads:
vzoom 9718

Note that I made some changes in test:
http://groups.google.ru/group/comp.programming.threads/msg/2143c3a4344f25a7

Dmitriy V'jukov

unread,

Aug 12, 2008, 6:39:07 AM8/12/08

to

On 12 авг, 14:08, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> Anyway, could you run test with 2 threads and 10 threads? Also, try a
> single-threaded run. And make sure to define NDEBUG.

2 threads:
ders 2312
vzoom 2359

10 threads:
ders 5843
vzoom 6093

Compiler options:
/Ox /Ob2 /Oi /Ot /Oy /GT /GL /D "WIN32" /D "NDEBUG" /D "_MBCS" /GF /
FD /EHsc /MT /GS- /arch:SSE2 /Zi /TP

Dmitriy V'jukov

Chris M. Thomasson

unread,

Aug 12, 2008, 6:49:31 AM8/12/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:8afc7292-6d0e-427f...@r66g2000hsg.googlegroups.com...

On 12 авг, 13:46, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

[...]

> Well, it seems that comparison was not very honest.
> First, by concatenation of circumstances ders allocator was using
> vzoom allocator as underlying allocator. Now I turn off vzoom
> allocator on ders test.
> Second, I create stl compatible allocator for vzoom.
> Third, I turn off debug checks wrt sturtup in vzoom allocator.
> Also I notice 2 errors in ders allocator. First, it doesn't align
> memory! It just returns random addresses. Second, it returns NULL,
> when I allocate 0 bytes. I don't fix them.

the alignment code in the pre-alpha version of my region allocator is
suboptimal at best... See, I am using sizeof on the following union to
determine max alignment:

union aligner {
char m_char;
short m_short;
int m_int;
long m_long;
double m_double;
float m_float;
aligner* m_this;
void* m_ptr;
void* (*m_fptr) (void*);
std::size_t m_size_t;
std::ptrdiff_t m_ptrdiff_t;
};

This clearly makes the max alignment bigger than it should be. I should do
this instead:

struct aligner_calc {
char padder;
aligner calc;
};

#define ALIGN_MAX offsetof(aligner_calc, calc);

Also, I am use a expensive modulo operator in the following macro:

#define IS_ALIGNED(mp_this, mp_align) ( \
! (((std::ptrdiff_t const)((mp_this))) % \
((std::ptrdiff_t const)((mp_align)))) \
)

this macro is used every time something is freed via the region::from_ptr
function! NOT GOOD! I need to change the macro to something like:

#define IS_ALIGNED(mp_this, mp_type, mp_align) ( \
(mp_this) == ALIGN_SIZE(mp_this, mp_type, mp_align) \
)

That said, I notice a bug in my allocator. It won't throw std::bad_alloc if
an allocation fails. Well, I guess that its good that its only version 0.0.0
pre-alpha!

Ouch!

Okay, I will go ahead and fix those issues, and repost the code.

BTW, thank you so much for your time Dmitriy! I really do appreciate all of
your help wrt this issue.

[...]

Chris M. Thomasson

unread,

Aug 12, 2008, 7:03:38 AM8/12/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:6Rdok.17763$1N1....@newsfe07.iad...

> "Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
> news:8afc7292-6d0e-427f...@r66g2000hsg.googlegroups.com...
> On 12 авг, 13:46, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> [...]
>
>
>> Well, it seems that comparison was not very honest.
>> First, by concatenation of circumstances ders allocator was using
>> vzoom allocator as underlying allocator. Now I turn off vzoom
>> allocator on ders test.
>> Second, I create stl compatible allocator for vzoom.
>> Third, I turn off debug checks wrt sturtup in vzoom allocator.
>> Also I notice 2 errors in ders allocator. First, it doesn't align
>> memory! It just returns random addresses. Second, it returns NULL,
>> when I allocate 0 bytes. I don't fix them.
>
>
> the alignment code in the pre-alpha version of my region allocator is
> suboptimal at best... See, I am using sizeof on the following union to
> determine max alignment:

[...]

I fixed the alignment code:

http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html

Now its not wasting memory on every allocation!

> That said, I notice a bug in my allocator. It won't throw std::bad_alloc
> if an allocation fails. Well, I guess that its good that its only version
> 0.0.0 pre-alpha!

Stupid me! I forgot that the region::create function already throws
bad_alloc if allocation fails!

ARGHGH!

Chris M. Thomasson

unread,

Aug 12, 2008, 7:20:41 AM8/12/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:93f3cb43-98ef-4f1a...@a70g2000hsh.googlegroups.com...

On 12 авг, 14:28, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message
>
> news:1cb073c7-ab81-4552...@f36g2000hsa.googlegroups.com...
> > On 12 Á×Ç, 10:20, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> >
> > > > 2 threads: Sergey(9656) -vs- Chris(2875) milliseconds
> >
> > > > 10 threads: Sergey(39213) -vs- Chris(14563) milliseconds
> >
> > > > This does not prove anything! I urge others to run the benchmarks an
> > > > post
> > > > the results here please!
> > > I compile both with MSVC8 /Ox. Hardware is Intel Core 2 Quad 2.4GHz.
> > > And set N=10000, M=10000.
> >
> > [...]
> >
> > One other request Dimitriy; In the follow section of my code:
> >

[...]

> >
> > Try increasing the prime count by 2 or 3, and increasing the max count
> > by 3
> > or 4; just make sure the max count is a little bit larger than the prime
> > count. See if that effects the testing at all. Also, you could try
> > increasing the head size a bit. I am interested to see how it changes
> > things
> > for better or worse.

> I set:
> # define THREAD_ALLOCATOR_HEAP_SIZE (262144*4)
> # define THREAD_ALLOCATOR_PRIME_COUNT (1+3)
> # define THREAD_ALLOCATOR_MAX_COUNT (2+4)

> Now results:
> 4 threads:
> vzoom 2390

> 16 threads:
> vzoom 9718

It make a couple of hundred ms improvement; not that bad... I wonder how
much correct alignment code would slow down ders allocator; humm... When you
mention this alignment issue, I realize that my aligner code was fairly
crappy:

http://groups.google.com/group/comp.programming.threads/msg/24a5de7cb6c08614

therefore I fixed it:

http://groups.google.com/group/comp.programming.threads/msg/efb379890789a0e8

Then I thought, hey, mine at least try to align things correctly, and ders
apparently does not... Well heck, that's not fair! Also, it seems like his
code simply does not like hyper-threading... Why? Well, I don't know.
Perhaps it because of a thread stack alignment issue inherent in early
hyper-threading processors which cause false-sharing. Intel's fix was to
offset the thread stack by a increacing size. This is why AppCore contains
the following code in the file:

http://webpages.charter.net/appcore/appcore/src/ac_thread_c.html
(refer to last function in file...)

void* AC_CDECL
prv_thread_entry
( void *state )
{
int ret;
void *uret;
ac_thread_t *_this = state;

ret = pthread_setspecific
( g_tls_key,
_this );
if ( ret ) { assert( ! ret ); abort(); }

if ( _this->id < 64 )
{
AC_OS_ALLOCA( 2048 * _this->id );
uret = _this->fp_entry( (void*)_this->state );
}

else
{
uret = _this->fp_entry( (void*)_this->state );
}

return uret;
}

I use the Intel recommended hack to get around problem. Now, let me try to
find a link to documentation of the problem:

http://softwarecommunity.intel.com/Wiki/Multi-threadappsforMulti-core/487.htm

Bingo! The 64-k aliasing problem! Nasty bug. False-sharing sucks!

;^)

> Note that I made some changes in test:
> http://groups.google.ru/group/comp.programming.threads/msg/2143c3a4344f25a7

Cool.

Sergey P. Derevyago

unread,

Aug 12, 2008, 7:27:24 AM8/12/08

to

Chris M. Thomasson wrote:
> http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
> but I cannot because it requires the ability of allocating in one thread
> and freeing in another...
>

Why don't you simply comment these stuff out?
Or _every_ alloc/dealloc pair of this test crosses the thread boundaries?

Sergey P. Derevyago

unread,

Aug 12, 2008, 7:36:50 AM8/12/08

to

Dmitriy V'jukov wrote:
> Also I notice 2 errors in ders allocator. First, it doesn't align
> memory! It just returns random addresses.
>

Could you elaborate please?

The blocks are aligned to sizeof(void*)
http://ders.stml.net/cpp/mtprog/doc/mem__pool_8cpp-source.html

> Second, it returns NULL,
> when I allocate 0 bytes. I don't fix them.
>

The same URL:

void* mem_pool::allocate(size_t size)
{
assert(size>0);
...

Chris M. Thomasson

unread,

Aug 12, 2008, 7:51:43 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:48a175d3$0$90274$1472...@news.sunsite.dk...

> Dmitriy V'jukov wrote:
>> Also I notice 2 errors in ders allocator. First, it doesn't align
>> memory! It just returns random addresses.
> >
> Could you elaborate please?
>
> The blocks are aligned to sizeof(void*)
> http://ders.stml.net/cpp/mtprog/doc/mem__pool_8cpp-source.html

Read this entire thread:

http://groups.google.com/group/comp.lang.c/browse_frm/thread/6fc4da438e08028b

aligning to sizeof(void*) is simply insufficient... free function, and
especially member function pointers, might need different alignments. Also,
on my 32-bit x86 platform with MSVC and GCC, doubles need stricter alignment
than sizeof(void*), they need to on 8-byte boundaries.

Also, run this program:

http://groups.google.com/group/comp.lang.c/msg/be7e0d0e97c5e1d9

on one of your systems and report the output. Here is what I happen to get:

(8) == ALIGN_MAX
(001EFF70) == rawbuf
(001EFF80) == l2cachebuf
(001F0000) == pagebuf
(001F0000) == superbuf

and sizeof(void*) is 4. Well, your allocator will cause undefined behavior
on my system.

>> Second, it returns NULL,
>> when I allocate 0 bytes. I don't fix them.
>>
> The same URL:
>
> void* mem_pool::allocate(size_t size)
> {
> assert(size>0);

You can legitimately pass zero to malloc. Here is docs:

http://www.opengroup.org/onlinepubs/009695399/functions/malloc.html

An implementation can return NULL or some pointer to unique block. Anyway,
its undefined behavior if user makes use of this block for anything else but
passing it back to free.

Chris M. Thomasson

unread,

Aug 12, 2008, 7:55:40 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:48a1739d$0$90262$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
>> but I cannot because it requires the ability of allocating in one thread
>> and freeing in another...
>>
> Why don't you simply comment these stuff out?
> Or _every_ alloc/dealloc pair of this test crosses the thread boundaries?

Its explicitly designed to generate freeing across thread boundaries; I did
this to test the scalability of the allocator under load. I think your test
works fine for a pure thread-local test. But I have always been interested
in creating allocator with great single-thread, and scaleable multi-thread
performance characteristics.

Dmitriy V'jukov

unread,

Aug 12, 2008, 8:05:31 AM8/12/08

to

On 12 авг, 15:36, "Sergey P. Derevyago" <non-exist...@iobox.com>
wrote:

> Dmitriy V'jukov wrote:
> > Also I notice 2 errors in ders allocator. First, it doesn't align
> > memory! It just returns random addresses.
>
> >
> Could you elaborate please?
>
> The blocks are aligned to sizeof(void*)http://ders.stml.net/cpp/mtprog/doc/mem__pool_8cpp-source.html

General purpose memory allocator must return address with the MOST
STRINGENT alignment, not just sizeof(void*).

See ISO C++03:
3.8/1:
The lifetime of an object is a runtime property of the object. The
lifetime of an object of type T begins
when:
-- storage with the proper alignment and size for type T is obtained,
and...

3.9/5:
Object types have alignment requirements (3.9.1, 3.9.2). The alignment
of a complete object type is an
implementation-defined integer value representing a number of bytes;
an object is allocated at an address
that meets the alignment requirements of its object type.

5.3.4/10:
A new-expression passes the amount of space requested to the
allocation function as the first argument of
type std::size_t. That argument shall be no less than the size of the
object being created; it may be
greater than the size of the object being created only if the object
is an array. For arrays of char and
unsigned char, the difference between the result of the new-expression
and the address returned by the
allocation function shall be an integral multiple of the most
stringent alignment requirement (3.9) of any
object type whose size is no greater than the size of the array being
created. [Note: Because allocation
functions are assumed to return pointers to storage that is
appropriately aligned for objects of any type, this
constraint on array allocation overhead permits the common idiom of
allocating character arrays into which
objects of other types will later be placed. ]

5.7.3.1/2:
The pointer returned shall be suitably aligned so that it can be
converted to a
pointer of any complete object type and then used to access the object
or array in the storage allocated...

> > Second, it returns NULL,
> > when I allocate 0 bytes. I don't fix them.
>
> The same URL:
>
> void* mem_pool::allocate(size_t size)
> {
> assert(size>0);

User CAN allocate 0 bytes. And allocator MUST return unique address
anyway.

3.7.3.1/2 Allocation functions:
Even if the size of the space requested is zero, the request can fail.
If the request succeeds, the value returned shall be a nonnull pointer
value (4.10) p0 different from any previously returned value p1.

I am quite surprised to see C++ memory allocator which doesn't satisfy
those requirements...

Dmitriy V'jukov

Chris M. Thomasson

unread,

Aug 12, 2008, 8:23:04 AM8/12/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:2af32de6-bfae-4166...@m3g2000hsc.googlegroups.com...

On 12 авг, 15:36, "Sergey P. Derevyago" <non-exist...@iobox.com>
wrote:
> > Dmitriy V'jukov wrote:
> > > Also I notice 2 errors in ders allocator. First, it doesn't align
> > > memory! It just returns random addresses.
> >
> > >
> > Could you elaborate please?
> >
> > The blocks are aligned to
> > sizeof(void*)http://ders.stml.net/cpp/mtprog/doc/mem__pool_8cpp-source.html

> General purpose memory allocator must return address with the MOST
> STRINGENT alignment, not just sizeof(void*).

[...]

> User CAN allocate 0 bytes. And allocator MUST return unique address
> anyway.

> 3.7.3.1/2 Allocation functions:
> Even if the size of the space requested is zero, the request can fail.
> If the request succeeds, the value returned shall be a nonnull pointer
> value (4.10) p0 different from any previously returned value p1.

AFAICT, the above says that malloc can respond to a request of zero bytes in
one of two ways...

It can "pretend" it failed by returning NULL, or it can return non-NULL
address which can be passed back to free. This is implementation defined.
Although, I believe that the only time it returns NULL is if the allocator
was truly out of memory, or else it could trick the program into thinking
its under an out-of-memory condition, when its really not. Also, any code
which does something like:

char* ptr = malloc(0);
if (ptr) {
free(ptr);
}

if 100% conforming. Although, a program that does this:

char* ptr = malloc(0);
if (ptr) {
ptr[0] = 'a';
free(ptr);
}

results in undefined behavior because according to the allocation, the
non-null `ptr' variable points to a chunk of "something" in the allocator
that is effectively zero-bytes long which means that any mutations will
cause demons to fly out of the programmers nose or something... So, AFAICT,
the following should be legit:

void* malloc(size_t sz) {
if (! sz) {
static char zero_bytes;
return &zero_bytes;
}
[...];
return [...];
}

> I am quite surprised to see C++ memory allocator which doesn't satisfy
> those requirements...

yeah. Well, I would overlook it if the code was version 0.0.0 pre-alpha...

;^D

Dmitriy V'jukov

unread,

Aug 12, 2008, 8:20:42 AM8/12/08

to

On 12 авг, 15:51, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> aligning to sizeof(void*) is simply insufficient... free function, and
> especially member function pointers, might need different alignments. Also,
> on my 32-bit x86 platform with MSVC and GCC, doubles need stricter alignment
> than sizeof(void*), they need to on 8-byte boundaries.
>
> Also, run this program:
>
> http://groups.google.com/group/comp.lang.c/msg/be7e0d0e97c5e1d9
>
> on one of your systems and report the output. Here is what I happen to get:
>
> (8) == ALIGN_MAX
> (001EFF70) == rawbuf
> (001EFF80) == l2cachebuf
> (001F0000) == pagebuf
> (001F0000) == superbuf
>
> and sizeof(void*) is 4. Well, your allocator will cause undefined behavior
> on my system.

You forget about platform-specific stuff. For example x86 have SSE2.
And data for most SSE2 instructions MUST be aligned on 16 bytes
(otherwise you will get hardware exception). On such platform user is
free to allocate memory with custom memory allocator, and then pass it
to SSE2 instruction. And he will be surprised if he will get hardware
exception. Standard malloc() on x86 allocates memory exactly with
alignment 16.
So, to be production-ready you have to extend your 'union aligner'
with ifdefs and platform-specific stuff.

> You can legitimately pass zero to malloc. Here is docs:
>
> http://www.opengroup.org/onlinepubs/009695399/functions/malloc.html
>
> An implementation can return NULL or some pointer to unique block. Anyway,
> its undefined behavior if user makes use of this block for anything else but
> passing it back to free.

User can compare address of such block with another block for
equality.
For example. Server accepts requests. Some requests have associated
data and some don't have. For latter requests you can allocate zero-
sized block of memory, and then use something like this:
if (request_data_1 == request_data_2) ...
when you searching list of requests. Even if there are 2 requests
without associated data, their addresses MUST BE different.

Dmitriy V'jukov

unread,

Aug 12, 2008, 8:33:54 AM8/12/08

to

On 12 авг, 16:23, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > User CAN allocate 0 bytes. And allocator MUST return unique address
> > anyway.
> > 3.7.3.1/2 Allocation functions:
> > Even if the size of the space requested is zero, the request can fail.
> > If the request succeeds, the value returned shall be a nonnull pointer
> > value (4.10) p0 different from any previously returned value p1.
>
> AFAICT, the above says that malloc can respond to a request of zero bytes in
> one of two ways...
>
> It can "pretend" it failed by returning NULL, or it can return non-NULL
> address which can be passed back to free. This is implementation defined.
> Although, I believe that the only time it returns NULL is if the allocator
> was truly out of memory, or else it could trick the program into thinking
> its under an out-of-memory condition, when its really not.

Yeah, I will be very surprised if implementation will do so.
But some custom memory allocator is free to do so, provided it clearly
documents this moment. Then user will be just watching out to not
request 0 bytes. But if implementation at same time overloads global
operator new/malloc, then it can affect stdlib implementation, which
is not ready for such behavior.

> Also, any code
> which does something like:
>
> char* ptr = malloc(0);
> if (ptr) {
> free(ptr);
>
> }
>
> if 100% conforming. Although, a program that does this:
>
> char* ptr = malloc(0);
> if (ptr) {
> ptr[0] = 'a';
> free(ptr);
>
> }
>
> results in undefined behavior because according to the allocation, the
> non-null `ptr' variable points to a chunk of "something" in the allocator
> that is effectively zero-bytes long which means that any mutations will
> cause demons to fly out of the programmers nose or something... So, AFAICT,
> the following should be legit:
>
> void* malloc(size_t sz) {
> if (! sz) {
> static char zero_bytes;
> return &zero_bytes;
> }
> [...];
> return [...];
> }

NO! "the value returned shall be a nonnull pointer value (4.10) p0
different from any previously returned value p1"! Even if I allocate 0
bytes, I MUST have some "object identity" (unique address).

Dmitriy V'jukov

Sergey P. Derevyago

unread,

Aug 12, 2008, 9:01:37 AM8/12/08

to

Chris M. Thomasson wrote:
> and sizeof(void*) is 4. Well, your allocator will cause undefined
> behavior on my system.
>

sizeof(void*) was deliberately chosen for this _tutorial_ as it's
sufficient for widely used platforms.
Could you please list important REAL-world platforms that require more
than sizeof(void*)?

>> void* mem_pool::allocate(size_t size)
>> {
>> assert(size>0);
>
> You can legitimately pass zero to malloc. Here is docs:
>

mem_pool isn't malloc. The preconditions differ.

Sergey P. Derevyago

unread,

Aug 12, 2008, 9:12:20 AM8/12/08

to

Dmitriy V'jukov wrote:
> General purpose memory allocator must return address with the MOST
> STRINGENT alignment, not just sizeof(void*).
>

Please, see my reply to Chris.

Chris M. Thomasson

unread,

Aug 12, 2008, 9:18:31 AM8/12/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:7b491534-c936-4dad...@m44g2000hsc.googlegroups.com...

Your right! Thank you. Sure enough, when I include <xmmintrin.h> and add the
data-type `__m128' to the `union aligner', the `REGION_ALIGN_MAX' expands to
16 instead of 8. What do you think is the most elegant way to handle this...
Humm. I REALLY need to think here. Basically, if a user wants to use the
current commercial version of the vzoom slab allocator with SSE she/he would
have to use a special function called:

void* vz_malloc_aligned(void** pbasemem, size_t size, size_t align);

and use it like:

void* sse_base_mem;
__m128* sse_align_mem = vz_malloc_aligned(&sse_base_mem, sizeof(__m128),
16);
if (sse_align_mem) {
[use sse_align_mem];
vz_free(sse_base_mem);
}

So far, I have received no complaints. Humm... What do you think? Should I
just add an analogous function to the vzoom region allocator? Or should I
peform the per-arch #ifdefs. Something like:

#ifdef VZBUILD_COMPILER_MSVC
# ifdef VZBUILD_ARCH_X86_SSE
# include <xmmintrin.h>
# define VZBUILD_ARCH_X86_SSE_TYPE char m_ ## __LINE__
# endif
#endif

#ifndef VZBUILD_ARCH_X86_SSE_TYPE
# define VZBUILD_ARCH_X86_SSE_TYPE
#endif

#define VZBUILD_MALLOC_ALIGNER_EXTRA() VZBUILD_ARCH_X86_SSE_TYPE

That would work. Also, I could do:

# ifdef VZBUILD_ARCH_X86_SSE
# define VZBUILD_ARCH_X86_SSE_TYPE VZBUILD_DECLSPEC_ALIGN(16) char m_ ##
__LINE__

#ifndef VZBUILD_ARCH_X86_SSE_TYPE
# define VZBUILD_ARCH_X86_SSE_TYPE
#endif

#define VZBUILD_MALLOC_ALIGNER_EXTRA() VZBUILD_ARCH_X86_SSE_TYPE

This could be extendable to different types... like:

# ifdef VZBUILD_ARCH_X86_SSE
# define VZBUILD_ARCH_X86_SSE_TYPE VZBUILD_DECLSPEC_ALIGN(16) char m_ ##
__LINE__
# endif

# ifdef VZBUILD_ARCH_WHATEVER_SPECIAL
# define VZBUILD_ARCH_WHATEVER_SPECIAL VZBUILD_DECLSPEC_ALIGN(128) char
m_ ## __LINE__
# endif

#ifndef VZBUILD_ARCH_X86_SSE_TYPE
# define VZBUILD_ARCH_X86_SSE_TYPE
#endif

#ifndef VZBUILD_ARCH_WHATEVER_SPECIAL
# define VZBUILD_ARCH_WHATEVER_SPECIAL
#endif

#define #define VZBUILD_MALLOC_ALIGNER_EXTRA() \
VZBUILD_ARCH_X86_SSE_TYPE ; \
VZBUILD_ARCH_WHATEVER_SPECIAL

Finally, after all of the #ifdef MESS, the aligner union can be as follows:

union aligner {
char m_char;
short m_short;
int m_int;
long m_long;
double m_double;
float m_float;
aligner* m_this;
void* m_ptr;
void* (*m_fptr) (void*);
std::size_t m_size_t;
std::ptrdiff_t m_ptrdiff_t;

VZBUILD_MALLOC_ALIGNER_EXTRA();
};

Which method do you like the best? Theoretically, the
`VZBUILD_MALLOC_ALIGNER_EXTRA' macro could contain special type alignments
for every platform vzoom supports.

One more thing... I need to add a member function type in the aligner... I
forgot to! ARGH.

;^(...

> > You can legitimately pass zero to malloc. Here is docs:
> >
> > http://www.opengroup.org/onlinepubs/009695399/functions/malloc.html
> >
> > An implementation can return NULL or some pointer to unique block.
> > Anyway,
> > its undefined behavior if user makes use of this block for anything else
> > but
> > passing it back to free.

> User can compare address of such block with another block for
> equality.
> For example. Server accepts requests. Some requests have associated
> data and some don't have. For latter requests you can allocate zero-
> sized block of memory, and then use something like this:
> if (request_data_1 == request_data_2) ...
> when you searching list of requests. Even if there are 2 requests
> without associated data, their addresses MUST BE different.

Great point. This is a bug in my region allocator as well. Luckily, the fix
is very trivial indeed! Here is what I need to do. In the following
function:
___________________________________________________________________________
void* allocate(std::size_t sz) throw(std::bad_alloc) {
0: assert(m_startup);
1: startup();
2: if (! m_startup) {
3: std::printf("STATIC STARTUP ORDER ERROR!!!!!!!!!!\n");
}
4: sz = ALIGN_SIZE(sz, std::size_t, REGION_ALIGN_MAX);
if (sz <= T_heap_size) {
region* node = m_head;
while (node) {
void* const ptr = node->allocate_local(sz);
if (ptr) {
if (node != m_head) {
if (! node->is_full()) {
promote(node);
}
}
return ptr;
}
node = node->m_next;
}
return expand()->allocate_local(sz);
}
return NULL;
}
___________________________________________________________________________

I need to change line 4 to something like:

sz = ALIGN_SIZE((! sz) ? 1 : sz, std::size_t, REGION_ALIGN_MAX);

this would allow the following to never fail:

{
tls_malloc.startup();
void* a = tls_malloc.allocate(0);
void* b = tls_malloc.allocate(0);
assert(a != b);
tls_malloc.deallocate(b);
tls_malloc.deallocate(a);
tls_malloc.shutdown();
}

Thanks again Dmitriy!!!! I am applying fixes and will post them here
shortly.

Chris M. Thomasson

unread,

Aug 12, 2008, 9:19:47 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:48a189b2$0$90272$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> and sizeof(void*) is 4. Well, your allocator will cause undefined
>> behavior on my system.
>>
> sizeof(void*) was deliberately chosen for this _tutorial_ as it's
> sufficient for widely used platforms.
> Could you please list important REAL-world platforms that require more
> than sizeof(void*)?

the type `double' on my platform I happen to be using right now requires an
alignment of 8. sizeof(void*) is 4, so its not going to work.

Chris M. Thomasson

unread,

Aug 12, 2008, 9:23:07 AM8/12/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:d5c29daa-d54d-454d...@x41g2000hsb.googlegroups.com...

Good point.

YIKES! STUPID ME!!!

:^o

I fix this in region allocator. It rounds zero-byte allocations up to 1.
That way, every zero-byte request will be unique, unless an out-of-memory
condition it hit, then it can return NULL and will trigger bad_alloc
exception.

Chris M. Thomasson

unread,

Aug 12, 2008, 9:28:09 AM8/12/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:O0gok.12823$3l5....@newsfe06.iad...

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Ummm. That is suppose to be:

# define VZBUILD_ARCH_X86_SSE_TYPE __m128 m_ ## __LINE__

of course!!!

> # endif
> #endif
>
> #ifndef VZBUILD_ARCH_X86_SSE_TYPE
> # define VZBUILD_ARCH_X86_SSE_TYPE
> #endif
>
> #define VZBUILD_MALLOC_ALIGNER_EXTRA() VZBUILD_ARCH_X86_SSE_TYPE

[...]

Sergey P. Derevyago

unread,

Aug 12, 2008, 9:29:09 AM8/12/08

to

Chris M. Thomasson wrote:
> the type `double' on my platform I happen to be using right now requires
> an alignment of 8. sizeof(void*) is 4, so its not going to work.
>

I don't think so: mp_impl::format_new_chunk() gets the memory using
operator new() and splits it into the blocks of the requested size
rounded to sizeof(void*).
I.e. sizeof(double) will be rounded to 2*sizeof(void*) which is exactly 8.

Chris M. Thomasson

unread,

Aug 12, 2008, 10:21:00 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:48a19026$0$90275$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> the type `double' on my platform I happen to be using right now requires
>> an alignment of 8. sizeof(void*) is 4, so its not going to work.
> >
> I don't think so: mp_impl::format_new_chunk() gets the memory using
> operator new() and splits it into the blocks of the requested size rounded
> to sizeof(void*).

> I.e. sizeof(double) will be rounded to 2*sizeof(void*) which is exactly 8.

Where are you getting the 2 from? Is it sizeof(double) / sizeof(void*)? I
still don't think your doing alignment correctly. What am I missing? Why
does the following GCC program not work on my platform as-is:
________________________________________________________________
#include <cstdio>
#include <cstddef>
#include <list>
#include <ders/stl_alloc.hpp>

typedef __attribute__((aligned(16))) unsigned char __m128[16];

#define IS_ALIGNED(mp_this, mp_align) ( \
! (((std::ptrdiff_t)(mp_this)) % (mp_align)) \
)

int main() {
ders::mem_pool mp;

double* d1 = (double*)mp.allocate(sizeof(*d1));
__m128* sse1 = (__m128*)mp.allocate(sizeof(*sse1));
double* d2 = (double*)mp.allocate(sizeof(*d2));
__m128* sse2 = (__m128*)mp.allocate(sizeof(*sse2));

if (! IS_ALIGNED(d1, 8) ||
! IS_ALIGNED(d2, 8)) {
std::puts("double not aligned!");
}

if (! IS_ALIGNED(sse1, 16) ||
! IS_ALIGNED(sse2, 16)) {
std::puts("sse not aligned!");
}

mp.deallocate(sse2, sizeof(*sse2));
mp.deallocate(d2, sizeof(*d2));
mp.deallocate(sse1, sizeof(*sse1));
mp.deallocate(d1, sizeof(*d1));

return 0;
}
________________________________________________________________

I get an output of

sse not aligned!

This will segfault if I try to use your allocator and SSE... AFAICT, using
sse is real-world. I need to fix my allocator to handle this as well. Also,
why am I required to include the size of an allocation when I call your
deallocate function? This can be a nuisance, and not work for some programs.
You not returning pointers with strictest alignment...

Chris M. Thomasson

unread,

Aug 12, 2008, 10:29:58 AM8/12/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:7b491534-c936-4dad...@m44g2000hsc.googlegroups.com...

Ummm.. Well, on my MSVC 8, the following program outputs OH CRAP! which
means that the standard allocator does not return pointers aligned on a
boundary of 16!:
_________________________________________________________________
#include <xmmintrin.h>
#include <cstddef>
#include <cstdio>

#define IS_ALIGNED(mp_this, mp_align) ( \
! (((std::ptrdiff_t)(mp_this)) % (mp_align)) \
)

int main() {
__m128* sse1 = (__m128*)std::malloc(sizeof(*sse1));
__m128* sse2 = (__m128*)std::malloc(sizeof(*sse2));
__m128* sse3 = (__m128*)std::malloc(sizeof(*sse3));

if (! IS_ALIGNED(sse1, 16) ||
! IS_ALIGNED(sse2, 16) ||

! IS_ALIGNED(sse3, 16)) {
std::puts("OH CRAP!");
}
std::free(sse3);
std::free(sse2);
std::free(sse1);
return 0;
}
_________________________________________________________________

Run the program and tell be what you get?

> So, to be production-ready you have to extend your 'union aligner'
> with ifdefs and platform-specific stuff.

Maybe not. The standard allocator on my copy of MSVC does not work for SSE!

:^O

[...]

Sergey P. Derevyago

unread,

Aug 12, 2008, 10:36:19 AM8/12/08

to

Chris M. Thomasson wrote:
> This will segfault if I try to use your allocator and SSE... AFAICT,
> using sse is real-world. I need to fix my allocator to handle this as
> well.
>

SSE is out of scope of this particular tutorial.

> Also, why am I required to include the size of an allocation when
> I call your deallocate function?
>

This is a C++-specific question.
The point is that it helps to create really fast C++ (class)
allocators...

Moreover, http://ders.stml.net/cpp/mtprog/doc/destroy_8hpp-source.html
shows some other non-trivial C++ tricks so you need certain C++
background to grok what's going on...

Chris M. Thomasson

unread,

Aug 12, 2008, 10:42:27 AM8/12/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:mXgok.19945$KZ....@newsfe03.iad...

> "Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message
> news:48a19026$0$90275$1472...@news.sunsite.dk...
>> Chris M. Thomasson wrote:
>>> the type `double' on my platform I happen to be using right now requires
>>> an alignment of 8. sizeof(void*) is 4, so its not going to work.
>> >
>> I don't think so: mp_impl::format_new_chunk() gets the memory using
>> operator new() and splits it into the blocks of the requested size
>> rounded to sizeof(void*).
>
>> I.e. sizeof(double) will be rounded to 2*sizeof(void*) which is exactly
>> 8.
>
> Where are you getting the 2 from? Is it sizeof(double) / sizeof(void*)? I
> still don't think your doing alignment correctly. What am I missing? Why
> does the following GCC program not work on my platform as-is:
> ________________________________________________________________

[...]

> ________________________________________________________________
>
>
> I get an output of
>
> sse not aligned!
>
>
> This will segfault if I try to use your allocator and SSE... AFAICT, using
> sse is real-world. I need to fix my allocator to handle this as well.
> Also, why am I required to include the size of an allocation when I call
> your deallocate function? This can be a nuisance, and not work for some
> programs. You not returning pointers with strictest alignment...

Ummm... Don't worry about this too much because, well, the standard
allocator (e.g., malloc) did not align for SSE either. I think a user is
going to have to do it manually. One other point, no matter how hard we try
to get alignment code to work, its still not going to be perfect. There are
some conforming platforms that will require weird alignments that our
allocators will not be able to cope with no matter how hard we try... I can
get my region allocator to conform to SSE by adding the `__m128' data-type
to the aligner union, but that's only x86 specific. There will be some other
platform that will have something that the aligner union simply does not
include.

Chris M. Thomasson

unread,

Aug 12, 2008, 10:47:55 AM8/12/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:48a19fe4$0$90272$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> This will segfault if I try to use your allocator and SSE... AFAICT,
>> using sse is real-world. I need to fix my allocator to handle this as
>> well.
> >
> SSE is out of scope of this particular tutorial.

Apparently SSE is out of scope for system provided allocator as well!!

;^)

> > Also, why am I required to include the size of an allocation when
>> I call your deallocate function?
> >
> This is a C++-specific question.
> The point is that it helps to create really fast C++ (class)
> allocators...

Completely agreed. Its a huge help to know the size when your freeing
memory.

> Moreover, http://ders.stml.net/cpp/mtprog/doc/destroy_8hpp-source.html
> shows some other non-trivial C++ tricks so you need certain C++ background
> to grok what's going on...

I get it. I am more of a C guy, and my region allocator was specifically
designed to overload global operator new/delete. I don't think you can do a
global delete operator with an additional size parameter. Am I right?

Sergey P. Derevyago

unread,

Aug 12, 2008, 12:57:47 PM8/12/08

to

Chris M. Thomasson wrote:
> I don't think you can
> do a global delete operator with an additional size parameter. Am I right?
>

No, you can't :(

Chris M. Thomasson

unread,

Aug 15, 2008, 8:19:38 AM8/15/08

to

"Sergey P. Derevyago" <non-ex...@iobox.com> wrote in message

news:48a1c10c$0$90264$1472...@news.sunsite.dk...

> Chris M. Thomasson wrote:
>> I don't think you can do a global delete operator with an additional size
>> parameter. Am I right?
> >
> No, you can't :(

;^(...

Dmitriy V'jukov

unread,

Aug 15, 2008, 8:55:57 AM8/15/08

to

On Aug 12, 6:29 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message
>
> news:7b491534-c936-4dad...@m44g2000hsc.googlegroups.com...

Yes, you are right.
They just write: "The storage space pointed to by the return value is
guaranteed to be suitably aligned for storage of any type of object".
But doesn't specify what is "any type of object".

So I think, it's Okay for vzoom to use 8 byte alignment be default,
and provide vz_malloc_aligned().

Dmitriy V'jukov